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METHODS OF POPULATING DATA STRUCTURES FOR USE LN 
EVOLUTIONARY SIMULATIONS 

CROSS-REFERENCE TO RELATED APPLICATIONS 
This appHcation is a continuation-in-part of USSN 09/416,837, filed on 
5 October 12, 1999. 

This application also claims priority to "METHODS FOR MAKING 
CHARACTER STRINGS, POLYNUCLEOTIDES AND POLYPEPTIDES HAVING 
DESIRED CHARACTERISTICS" by Selifonov et al, PCT Application filed on Jatiuaiy 18, 
1999 (filed by the Law Offices of Jonathan Alan Quite, attorney docket so: 02-289-3PCO) 
1 0 which is a continuation-in-part of "METHODS FOR MAKING CHARACTER STRINGS, 
POLyNUCLEOTlDES AND POLYPEPTIDES HAVING DESIRED 
CHARACTERISTICS" by Selifonov ei at., USSN 09/416,375, fiied October 12, 1999, 
which is a non provisiona! of "N4'ETH0DS FOR MAKING CHARACTER STRINGS, 
POLYTS-UCLEOTIDES AND POLYPEPTIDES HAVING DESIRED 
1 5 CHARACTERISTICS" by Selifonov and Stemmen USSN 60/1 1 6,447, filed Januar>' 19, 
1999 and which is also a non-provisjonal of "METHODS FOR MAKING CHARACTER 
STRINGS, POLYNUCLEOTIDES AND POLYPEPTIDES HAVING DESIRED 
CHARACTERISTICS" by Selifonov and Stemmer, USSN 60/1 J 8,854, filed February 5, 
1999. 

20 This application aiso claims priority to "OLIGONUCLEOTIDE MEDIATED 

NUCLEIC ACID RECOMBINATION' by Craitieri et at, PCT Application filed on January 
1 8, 1999 (filed by the Law Offices of Jonathan Alan C^ite, attorney docket no: 02-296-3PC) 
which is a continuation-in-part of "OLIGONUCLEOTIDE MEDIATED NUCLEIC ACID 
RECOMBINATION" by Crameri et ai, USSN 09/408,392, filed September 28, 1999, which 

25 is a non-provisional of "OLIGONUCLEOTIDE MEDIATED NUCLEIC ACID 

RECOMBINATION" by Crameri et aL, USSN 60/1 iS,8i3, filed February 5, 1999 and 
which is also a non-pro visional of "OLIGONUCLEOTIDE MEDIATED NUCLEIC ACID 
RECOMBINATION" by Crameri et aL, USSN 60/141,049, filed June 24, 1999, 
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This applicaUon is also related to "USE OF CODON VARIIKD 
OLIGONUCLEOTrOE SYXTlIHSrS FOR SYNIHETIC SHUFFLING" b\ Wek-h et al 
USSN 09/408,393, filed September 28. 1999. 

Ihc p :;&e'n apphw: o ^.^in-, j.^.f'- v to and herKilo cd^h tnese 
5 appbc il-ioris as p>ei\ 'tied tin anuef 35 I SC vf i 1 9 t,nd 'oi I SC M2v!, cSs appiopnate 
All ui these apphcdliun!> are incorporated heiem by icfeicnce m thcu entirety ior all 
putposes. 

COPYRIGHT STATEMEN I 
A portion of the disclosure of this patent document contains material which is 
10 mh^m to copyright protection. Tiie copyright owner has no objection to the facsimile 

jt eproducUon by any-one of the patent document or the patent disclosure, as it appears in the 
Patent and Trademark OfSce patent file or records, but otherwise reserves ail copyrightrights 
whatsoever. 

STATEMENT AS TO RIGHTS TO INVENTIONS MADE UNDER FEDERALLY 
15 SPONSORED RESEARCH AND DEVELOPMENT 

j Not Applicable ] 

FIELD OF THE LN VENTION 
This invention relates to the field of computer modeling and simulations. In 
particular, this invention provides novel methods of populating data structures for use in 
20 evolutionary modeiing, 

BACKGROUND OF THE INVENTION 

There is aii extensive history of the use of computers to simulate and/or 
investigate the evolution of life, of individual genetic systems and/or population 
genetic/phenotypic systems. The motor propeUiug most artificial life (Alife) simulations is 
25 an algorithm which allows artificial creatures to evolve amVor adapt to their environmetit. 
The Itmdamenta! ;ilgoritb.nis fali itno I'.vo copvlii.^rrL ca-icgor-cs: ieamis'ig uJgorithms icg., 
aigorithn^s typified by neura! net work an<.i c\oiiit!Ofr;ry ai^gontbtus- typified, for cxrimple. 
by genetic aSgorithms. 

Many artificial life researchers, especially those concemed Vi^'ith higbei -order 
30 processes such as learning and adaptation, endow tlieir organisms with a neural net which 
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serves as an artificial brain {see, e.g., Toarctzky (1088-199i), Neural b\fonnatkm Pnxx^i^'smg 
Systems, volume 1-4. Morgan Kaufinann. 1988-1991 . Neural networks aie learning 
aigonlhmt. rhcy max bt trairico t *o '^■ /B3^;i into categories. A typical task is to 
recognt/e to \%h!ch letter a given Kanu-v^ritien chaiacK-i tuucs-ponU^ 
5 A neural net is cun)pos.c<,f ol'a coHectton of niput-outpat devices, called 

neurons, which are organized in a (highly connected) network. Nonnaliy the network is 
organized into layers: an input layer which receives sensory input, any ntjmber of so-calied 
hidden layers which perform the actual computations, and an output layer which repotls the 
, results of these computations. Training a neural netvsfork involves adjusting the strengths of 

1 0 the connections between the neurons in the net. 

The other major type of bioiogicaliy inspired fundamental algorithms are the 
evolutionary algorithms. While learning processes {e.g., nearal networks) are melaphoricaily 
based on learning processes in individual organisms, evolutionary algorithms are inspired by 
evolutionary change in populations of individuals. Relative to neural nets, evolutionary 

15 algorithms have only recently gained wide acceptance in academic and industrial circles. 
Evolutionary algorithms are generally iterative. An iteration is lypicaliy 
referred to as a "generation". The basic evoiutionar>' algorithm tradilionally begins witii a 
population of randomly chosen individuals. In each generation, the individuals "compete" 
among themselves to solve a posed problem, Individuals which perform relatively well are 

20 more likely to "survive" to the next generation. Those sun-iving to the next generation may 
be subject to a small, random motiifications. If tlie algorithm is correctly set up, and the 
problem is indeed one subject to solution in this manner, then as the iteration proceeds the 
population will contain solutions of increasing quality. 

The most popular evoiutionary algorithm is the genetic algorithm of J. 

25 Holland (J.H. Holland { 1 992) Adaptation in Natural and Artificial Systems, Univemty of 
Michigan Press 1975, Reprinied by MIT Press.). The genetic algorithm is widely used in 
practical contexts {e.g., ilnancial forecasting, management science, etc.). It is particulaiiy 
well-adapted to multivariate probieins whose solution space is discontinuous ("mgged") and 
poorly understood. To apply the genetic algorithm, one defines 1) a mapping from the set of 

30 parameter values into the set of (0-1) bit strings (eg, character strings), and 2) a mapping 
from bit strings into the reals, the so-called .fitness function. 

In most evolutionary algorithms, a set of randomly-chosen bit strings 
constitutes the initial population, hi the basic genetic algorithm^, a cycle is repeated during 
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v^hitb the httu of each mdividuai in the populatuu 5b evaluated, i.opie-> uf indn jJuaii. are 
madv." iji prupnitjon lo then fitness; and the lvc c rcoeaied The typn^ai srartmg potni Joi 
such evolutionarv algontl«Ti.s ]i!a sc. o-'i^nu ir] ^\v>t\-i ' u- '"he a'^e ut an 
'j!biiraj\" nmlom <n haphaZuiG <5t-u',,,; ^<";-u n u, ^trorgU hu-, tvoludonar^ 
5 aiJOiithif! a\^a^ ftoni on cfticir u Ci-^^ u Ut, o ^ 'j ■-oiiitjon In {he probit-ntat hand, 

pajtit.ulaji> wht'ie the ^dguiUhm .s u.scd lo - n„c, 05 u'utr^c a hsologicai histor>' 05 proce s 
Indeed the piily "force" drivuig the e\oiuuonar'/ alaonthm u> snv sokitjon whatboeves js a 
ijtiiess determination and associated selection pressure. While a solution may eventual!) be 
reached, because the process starts irom a raudoni {e.g. arbitrary) initial state in which the 
1 0 population members bear no relationship to each other, the population dynamics as the 
algorithm proceeds reveals little or no information reflecting the dynamics of the simulated 
system. 

In addition, cvohitionary algorithms are typically relatively high order 

simulations and provide populatioiv level iiiforrnation. Specific genetic informationj if it is 
15 presciit at all, typically exists as an abstract repriisentation of an allele (typically as a single 

character) or allele frequency. Consequently evolutionary algorithms provide little or no 

information regarding events on a molecular level. 

Similarly, neural nets and/or cellular automata, take as their starting point, 

essentially artificial comtructs and utilize interaai rules (algorithms) to approximate 
20 biological processes. As a consequence such models generally mimic processes or 

metaprocesses, but again afford little or no information or insight regarding events at the 

molecular level. 

SUMMARY OF THE INVENTION 

This invention provides novel methods of generdting "initial" populations 
25 suitable for fiirther computational manipulation, e.g. via genetic/evolutionary algorithms. 
The members of populations generated by the methods of this invention pos,sess varying 
degrees of "relaledness" or "similarity" to each other reflective of the degrees of covariance 

found m naturally occurring populauont. in addition, unlike the populaijou,^ used as input \n 
t\pit.al e\okitionar\ alyoruhms the nnpu'at ors jreneiated by the mcth(id,s p!Ovidi--d hetcui 
30 typically coruaiii d<.tat!cd intonn, .10 1 c-'^ol t .-i^ \ .lu; . Tiijfubti'i aud tlic uiUnmauuu is 

typicailv of suftlcieni complexity to provide a "contmuous" (rather tlun binar>) ineasuic of 
intermember variability and/or relatedness. Indeed the methods of this invention provide 
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dotaiicd coding of moieciiiar iiifomiauort in the individuals comprising tJie populations, 
created according to the methods of this invention. 

Thus, in one embodiment. Ibis invention provKica methods ofpopuiating u 
data structure with ic.g. gcncrauug a collection or iibiary oi") character string^; The method 
5 prefcrabiy involve i) encoding two or more a biological molecules into characier strings to 
provide a collection of two or more different initial character strings wherein each of said 
biological molecules comprisss at least about 10 subunits; ii) selecting at least two 
substrings from said character strings; iii) corxatenating said substrings to form one or 
more product strings about the same length as one or more of the initial character strings; jv) 

10 adding the product strings to a coUectioji of strings (a datastmcture); atid v) optionally 
repeating steps (i) or (ii) through (iv) using one or more of said product strings as an initial 
string in the collection of initial character strings. In particularly prefewed embodiments, the 
"encoding" comprises encoding one or more nucleic acid sequences and''or one or more 
amino acid sequences into the character strings. The nucleic acid and'or amino acid 

15 sequencevS can be unknown and/or haphazardly selected, but preferably encode known 
proieiij(s). in one preferred embodiment, biological molecaies are selected such that they 
have at least about 30%, preferably at least about 50%,. more preferably at leas! about 75%, 
and most preferably at least about 85%, 90%, or even 95% sequence identity with each other, 
in one embodiment, the substriiig{s) are selected such that the ends of the 

20 substrings occur in character string regions of about 3 to about 300, preferably about 6 to 
about 20, more preferably about 10 to about 100 and most preferably about 20 to about 50 
character that have higher sequence identity with the corresponding region of another of the 
initial character strings than the overall sequence identity between the same two strings. In 
another embodiment, the selecting can in volve selecting substrings such that the ends of said 

25 substrings occur in predefined motifs of about 4 to about 1 00: preferably from about 4 to 
about 50, even more preferably from about 4 to about 1 0, still more preferably from about 6 
to about 30 and most preferably from about 6 to about 20 characters. 

In one embodiment, the selecting and concatenating can comprises 
concatenating substrings from two different initial strings such that the concatenation occttrs 

30 in a region of about three to about twenty characters having higher sequence identity 
between two different initial strings than the overall sequence identity between the two 
different initial strings. The selecting can also comprise aligning two or more of said initial 
cfaaraeter strings to maximize pairwise identity between two or more substrings of the 
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character str ings, and seiecting a character tiiat is a member of an aligned pair for the end of 
one substring. 

In certain embodiments, the "adding" step invoivcs calculating the thcoretjcal 
PI. PK, molecular weight, hydrophobicity, secondary strucnire and/or other pjoperties of a 
5 protein encoded b>' the ciiiiracror firing. Iji one preferred eiiibudinienl, the product firings 
are added to the collection (datastructore) only if they have greater than 30%, preferably 
greater than 50%,. more preferably greater than 75% or 85% sequence identity with the 
initial strings. 

The method can fnitlier involve rafidomiy alteriiig one or more characters of 

1 0 the character strings. This can be accomplished according to a nnmber of methods inc iudiag, 
but not limited to introducing a random string into the initial string collection and/or utilizing 
a stochastic operator as described herein. In a particularly preferred embodiment, tfie 
operations described above are performed in a computer. 

in another embodiment, this invention provides a computer program product 

15 comprising computer code that i) encodes two or more a bioiogical molecules into character 
strings to provide a collection of two or more different initial character strings wherein each 
of said biological moiecuies comprises at least about ten subuirits; ii) selects at least two 
substrings from the character strings; iii) concaienates the substrings to form one or more 
product strings about the same length as one or aiore of the initial character strings; iv) adds 

20 the product strings to a collection of strings (/.e., populates a datastnicture); and v) 

optionally repeats steps (i) or (ii) through (iv) using one or more of the product strings as an 
initial string in the collection of initial character strings. In other words, the computer 
program product comprising computer code that performs tlie operations described herein. 
The program code can be provided in compiled form, as source code, m object code, as an 

25 executable, etc. The program can be provided on any convenient medium, e.g., magnetic 
media, optical media, electronic media, optomagnetic media, etc. The code can also be 
present on a computer, e.g. m memoi7 (dynamic or static raemoiy) on a hard di'ive, etc. 

In another embodiment, this invention provides a system for generating labels 
(tags) and'or music derived from the sequences of biological moiecuies. The system 

30 comprises an encoder for encoding two or more initial strings from biological moiecuies 
(e.g. nucleic acid and/or proteins); an isolator for identifying and selectmg s«b.strings from 
the two or more strings; a concatenator for concatenating die substrings; a data structure for 
storing the concatenated substrings as a collection of strings; a comparator for measuring the 
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number and/or variability of the coilection of strings and detennining thai sufficient strH^g^ 
exist in the eolJcction of strings; and a command writer for writing the collection ot strings 
itno a ran' string f:l:, n i .i; oodn ^e^it. :hj i^o'ator comprises a <jv:>mptirator foi 

aligiuRg and dvlemK-isn j -ctriC-n^ o^ iacr.tK> l■'c^^^ :o'. ' v o or more iniiial strinps, Sinisiarly 
5 the i;tinipdrat(»r m.ry cotnpi isc a nie.r-s ti ^aLu\ itng icqtieiice identity aiid tlic isolator and 
comparnlor maj (ipttonaHv ihcirt- this means. In prefenrcd embodiments, tlic isolator st-kcfs 
.substniii^.s such that the ends of said substrings occur in string regions of about tiiiee to about 
100 chai-acters tiialhave higher sequence identity with the corresponding region of another 
of the initial character strings tiian the overail sequence identity between the same two 
10 strings- 

In atiother embodiment, the isolator selects substrings such that the ends of 
said substrings occur in predefined motifs of about 4 to about 100, preferably from about 4 to 
about 50, even more preferably from about 4 to about \ 0, stiil mors prefei^bly from about 6 
to about 30 and most preferably from ^bout 6 to about 20 characters. 1« one einbodimeiU, 

15 the isolator and eoncatehator individually or in combination concatenate substrings from two 
different iriitiai strings such that the concatenation occurs in a region of about 3 to about 300, 
more preferably about 5 to about 2(){), most preferably from about 10 to about 100 characters 
having higher sequence identity between said two different initial strings than the overall 
sequence identity betweeii said two different iiiitial strings. In one prefentd implementatioh, 

20 the isolator ahgus two or more of the initiai character strings to maximize pairw-isc identity 
between two or more substrings of the character strings, and selects a character that is a 
member of an aligned pair for tlie end of one substring. 

The comparator can impose any of a wide varied of selection criteria. Thus, 
in various embodiments, the comparator can calculate theoretical f*I, PK, molecular weight, 

25 hydrophobicity, secondar\' structure and ur other properties of an encoded protein. In one " 
preferred embodiment ,dK coinparator adds strings to the data structure only if they have 
greater than i(t*o identuv vMth liic imtial strings. 

Ihc stem caii optumally comprising an operator that randamly alters oae or 
more chatdcteis of the ciuiai,tei strings In certain enibodiments, such an operator can 

.30 raiidonil} select and altei one oi muie occurrences of a particular preselected character in 
chaiactcr sumgb Preferred datastructures in this system stores encoded (or 
deconvolved) nucleic acid sequences aad/br encoded or deconvolved amino acid sequences. 



-7. 
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A iltrthcr understanding ofihe .n\efitiOfj ^^\. he jud torn tht; detailed 
discussion of specific embodiments belov, I c vi rrc^scs ot t.lic !'^', this dsbcu^sioti lefm to 
deuces, methods, and con lc?*^ U'n^o-svci) Houcu'i the nuihoU ot Ihc 

present mvenhon may opemu' s\ lb ^ b o t> pes ot logical de\ icvs !i is thcRfoio 
5 iritcnd<.d that the in\ eniion nnt hr ' ' nt f •> v. ■> as provided m the attached claims {ass 
mtetprctcd under thf d>ii,tniic ot < ^ ■inl\) 

Furthernioit^ u is recogni/ed that itniic s\Meni . ciiii luchuk tt ^\lde \aru ts of 
dittcient component!? and different functions m a modular fashion Uittctcm cmoodimcnte 
of a system can include different mixtures of demems and fimcttons and may group vanous 
10 functions as parts of vanous elenaents. For purposes of clarity, the mventton is descnbed m 
terms of systems that include many different innovative components and itmovative 
combinations of components. No inference should be taken to limit the invention to 
combinations containing all of the innovative components listed in any illustrative 
embodinient in tliis specification, 

15 DEFtNITIONS. 

The terms "character string" "word", "binary string" or "encoded string" 

represent any entity capable of storing sequence information (e.g. the subumt siruclurc of a 

biological molecule such as the nucleotide sequence of a nucleic 3cid, the amino acid 

sequence of a protein, the sugar sequence of a polysaccharide, etc.). In one embodiment, tiie 
20 character string can be a simple sequence of characters (letters, numbers, or other symbols) 

or it can be numeric representation of such information in tangible or intangible (e.g. 

electronic, magnetic, etc.) form. The character string need not be "linear", but can also exist 

in a number of other forms, e.g. a linked list, etc. 

A "character" when used in reference to a character of a character string refers 
25 to a subunit of the string. In a preferred embodiment, the character of a character string 

encodes one subunit of the encoded biological molecule. Thus, for example, in a preferred 

embodiment, where the encoded biological molecule is a protein, a character of the string 

encodes a single amino acid. 

A "motif" refers to a pattern of suhunits comprising a biological molecule. 
30 The motif can refer to a subutiit pattern of the uiiencoded biological molecule or to a subunit 

pattern of an encoded representation of a biological molecule. 
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The Unn substring refers to a suing that a, found ^viihn-' ziiovm ^tnnq Thv 
substring can mchide the uill length "parent" slnriL' bui 'vpscalis the ^uhsiring icptesents a 
substring of the titli- length siring. 

ihc K'lm "dalci suuLtuftf" . v <> ynx/?j)ox, and upuuaaliy d=;so( uucd 

5 device lor the st.ii.iiie of fntomidtiou, typicallv muihpte "pieccv" oi ujfonnation The ddU 
bttuctyre cjn he a simple lecordation of the tnformalion (e.g. a list) or the data btructure can 
coiitam additional uilormation (e.g. annotations) regarding the infomiation contained therein, 
can establish reJationships between the various "members" (information "pieces") of the data 
structijre, and can provide pointers or liiiked to resources external to tibe data structttre. The 

1 0 data stmcture can be intangible but is rendered tangibie when be stored/represented in 
tangible medium. The data structure can represent various information architcctuics 
including, but not limited to simple lists, linked H.sts, indexed lists, data tables, indexes, hasli 
indices, flat file databases, relational databases, local databases, distributed databases, thin 
client databases, a«d the like. In preferred enibodimetuft, ilie data staicture provides fjeids 

15 sutBcient for the storage of one or more character strings. The data stmcture is preferably 
organized to permit alignment of the character strings and, optionally, to store information 
reganiing the alignment and/or string similarities and/or string differences. In one 
embodiment this information is in the form of alignment "scores" (e.g., similarity indices) 
gind-'or alignmeni; maps showing individuai subunit (e,^. nucleotide in the case of nnckic 

20 acid) alignments. The term "encoded character string" refers a representation of a biological 
molecule that preserves desired sequence/structural infomiation regarding that molecule. 

Similarity, when used herein can refer to a similarity measurement between 
the encoded representation(s) of a molecule (e-g-., the initial character strings) or between the 
molecules represented by the encoded character strings. 

25 Wlien referring to operations on strings (e,g. insertions, deletions, 

transformations, etc.) it wili be appreciated that the operation can be performed on the 
encoded representation of a biological molecule or on the "molecule" prior to encoding so 
lhat the encoded representation captures the operation. 

Tlie tenn "snbunit" when used in reference to a biological molecule refers to 

30 the characteristic "monomer" of which a biological is composed. Thus, for example, the 
subunit of a nucleic acid is a nucleotide, the subunit of a polj^peptide is an amino acid, the 
subimit of a poiysacchaiide is a sugar, eic. 
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The Xsrnis "pool" or "coliection" are used interchangeably when used to refer 

to strings, 

A "btoKiETici' ^ t cch -oic-v ui a .rc coi. o tvpscaliv tbund in <i bioliijuoai 
oigdiiiMii I'lckucd bioUigtcal inuiecuies .a-o -^uiioizicil m i^n'^moleoules that ari.' 
5 typiLdUy polvnttjric m nature being composed of multiple ^ubumK T^'pJcJ biolrfgicai 
molecules mciudc, but are not limited to nucleic acids (formed of nucleotide subuntts) 
proteins (formed of amino acid subumts), polysaccharides ( formed of sugar subunits), etc. 

The phrase "encoding a biological molecule" refers to the generation of a 
representation of that biological molecule that preferably contains and can therefore be used 
10 to recreate the information content of tite original biological molecule. 

The term "nucieic acid" refers to a deoxyribonucleotide or ribomicleotide 
polymer in either single- or double-stranded form, and unless otherwise limited, 
encompasses known analogs of natural nucleotides that can function in a similar manner as 
naturally occitrring nucleotides. 
15 A "nucleic acid sequence" refers lo the order and identity of the nucteotidea 

comprising a nucleic acid. 

The terms "polypeptide", "peptide" and "protein" are used intercimngeably 
herein to refer to a polymer of amino acid residues. The terms apply to amino acid polymers 
in which one or more amino acid residue is an artificial chemical analogue of a 
20 corresponding naturally occurring amino acid, as well as to naturally occmrmg amino acid 
polymers. 

A "polypeptide sequence" refers to the order and identitj^ of the amino acids 
comprising a polypeptide. 

llie phrase "adding the product strings to a collection of strings" as used 

25 herein does not require a mathematical addition. Rather it refers to a process of identifyiag 
one or more strings as included within a set of strings. This can be accomplished by a 
variety of means inckidmg, btit not limited to copying or moving the string(s) in question 
into a data stracatre that is a collection of strings, .setting or providing a pointer from the 
string to a data structure that represents a collection of strings, setting a flag associated with 

30 the vString indicating its inclusion in a particular set, or simply designating a rule that the 
string(s) so produced are included in the collection. 
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BRIEF DESCRIPTION OF THE DRAWINGS 
Figure 1 illustnites a flow chart depicting one embodiment of the methods of 

this invention. 

Figure 2 il lustrates a select ion and concatenation of subsequences according 
5 to the method(s) of this invention. 

Figure 3 illustrates a selection and concatenation of subsequences according 
to the method(s) of this invention where the concatenation utilizes an alignment algorithtn to 
fix the order of substrings. 

Figure 4 illustrates a representational digital device 700 according to tire 
1 0 present invention . 

Figure 5 is a chart and relational tree showing percent similarity for different 
subtilising (an exemplar set of initial chai-actev strings ). 

Figure 6 is a pairwise dot-ploS alignment showiiig homology areas for 
different subtilisiiis. 

1 5 Figure 7 is a pairvvise dot-plot alignment showing homology areas for seven 

different parental sufatilisins. 

DETAILED DESCmPTION 

h Generating populations of clisuacter strings. 

This invention provides novei computational methods So generaSe 

20 representations of actual or theoretical populations of entities suitable tor use as initial (or 
mature/processed) populattoiis in evoiutionarj' models more preferably in evolutionary 
models typified by genetic algorithms. When initialized to reflect features of particular 
biological organisms, the entities generated by the metliods of this invention each contain 
si^ifioant infottaation regardmg tmderlying inolecular biology (eg. representative amino 

25 acid or nucleic acid seqtience(s)) and thereby permit models based on genetic or other 

algorithms to provide unprecedented levet so information regarding evolutionary processes 
at tiie molecular level. 

in particularly preferred embodiments, the methods of this invention generate 
populations of character strings where each character string represent.s one or more- 

30 biological molecules. Using only a few strings as "seeds" the metliods generate large 

populations of strings bearing an "evolutionary" relationship to the initial seed members. In 
.11- 
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contrast io vradiuonal genetic algorithms in wliich initial member sets are arbitrary, 
randorn/haptiazaid, or selected for maihematical or representaHonal convenience, tl^e 
populations generated by the methods of this invention are, in preferred embodiments, 
derived from known existing biologiCiu ' pre-cursors" (e.g.. particuiar nucieic acid sequences 
and/or poiypepfide sequeTices k 

111 a pteierred eiiibodiniciu, ihe nieiitods ofthi.s invention involve: 

1 ) Ideniifying/selecting two or more biological moiecuies; 

2) Encoding the biologicai moiecuies into character strings; 

3) Selectkig at least two stibstrings from the character stritigs; 

4) eoncalenating said substrings to form otie or mote product strings 
about the same length as one or more of the initial character strings; 

5) Adding the product strings to a collection of strings which can be the 
set of initial strings or a separate set; and 

6) Optionally introducing additional yariation into the a resulting string 
set; 

?) Optionall.Y adding selection pressure io the reauiting string set. 

8) Optionaiiy repeating steps (2) or (3) tbrough (7) using one or more of 

the product striiigs as an initial string in the collection of initial 

character strings. 
Bach of these operations is described in more detail below. 

IL Encoding one or more h>oto; 5 ;ica1 molecules into character stdngs. 

The methods of this invention typically utihze one or more "seed" members. 
The "seed" members are preferably representations of one or more biological moiecuies. 
Thus, the initial steps of preferred embodiments of this invention involve selecting t^'o or 
more biologicai moiecuies and encoding the biologicai molecules into one or more character 
strings. 

A IdeaUfymg/seteetmg "seed/imtia bioiogical moleculefs). 

Virtually any biologicai molecule can be used in tlie methods of this 
invention. However, preferred bioiogical molecules are "polymeric" biological 
macromolecules comprising a multiplicity of "subunits". Biological macromolecules 
particuiariy well suited to the methods of this invention include, but are not limited to nucleic 
-12- 
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acids (eg ON A, IINA, etc ),pioteii3b, glycoproteins, carboiiy^iraies, poiysacchandes,, certaia 

tatty acids, and the like. 

When nucleic acids are selected, the nuclesc acjd can be suigie stranded or 

5 uptcscrtt Cin ode T double, \nJ^. ^ k i. e ^ i vid'- aro paferJiK Lnov^ii 

nu(.k)c cii.\ds ^udi nucieu acid •-t-quci w^n leacib' determined from x number of 
iourt.os inciudine, but not nniTied to puhbc d<iU'bascs {c ? , GenBani), propnctar>' databases 
(e.g. Incyte databases), scientific publications, conii-nerciai or private sequencing 
laboratories, m-house sequencing labotatoriesvefc 

1 0 The imcieie acids can include genomic nucJeic acids, cDNAs, mJRNAs, 

artificial sequences, natural sequences having modified nucieotides, and tlie like. 

In one preferred embodiment, the two or more biological molecules are 
"related", but not identical Thus, the nucleic acids may represent the same gene or genes 
but differ in the strain, species, genus, family, order, phylum or kingdom from which they 

iS are derived. Siiiiilarly, in one embodiment, the protein, polysaccharide, or other molecuIe{S) 
are the same protein, polysaccharide, or other mokcule(s) with differences between the 
molecules resulting from ihe lact that they are selected from different strains, species, genus, 
families, orders, phyla or kingdoms. 

ITie biological mo-iecules can represent a single gene product (e.j?. an mRM A, 

20 a cDNA, a protein, etc. ) or they can represent a collection of gene products and^or non- 
coding nucleic acids, in certain preferred embodiments, the biological molecules will 
i-epresent members of one or more particular metabolic pathways (e.g. regulatory, signaling 
or syntlietic patliways). Thus, for example, the biological molecules can include members 
coitiprisihg an entire operon , or a complete biosyntlmic pathway (e.g. , the lac operon, 

25 Protein: B-DNA gal operon, the colicin A operon, the hx opcron, polyketide synthesis 
pathways, etc.). 

in centiin prcfencd ombodimcms, the bioiogtcal molcculct, can include sny 

number uf different, <^Qm^. piotcm^. eu I n .s, m .d\r\ embodunents, liic biuiugical 

molecules could include the total nucleic ) J - ;o DNA, cDXA, or niRNAj oi 

30 total protein, or total lipid, eic, of an mdivKiaal, or muitipic individuals of the same or 

different species. 

In certain embodiments, the biological molecules can reflect a 

"representation" of tfie total population of that species' molecules. High order representation 
-13- 
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of populations of molecules is accompUshed in the labotatory and. according to the methods 
of this invention can be pcrfonncd m sUico. Methods of rcpre.seming complex molecules or 
popitiations of molecules are seen in Representationftl Difference Anfilysis (RDA) and 
related techiiiques {see, e.g.. Lisiisyn (.1995) Trends CeneU, 11(8): 303-307, Risinger eial 
3 {\99A) Mol CarcinGg. \ 1(1): 13 -18, and Mkhiets etaL {199%) Nucleic Acids Res. 26: 15 
3608-3610, and references cited therein). 

Particular preferred bioiogical molecuies for encoding and manipulation in 
the methods of tiiis invention include proteins and/or tlie nucleic acids encoding the proteins 
of various classes of molecules such as therapeutic proteins such as erythropoietin <EPO), 

10 insulin, peptide hormones such as human growth hormone; growth factors and cytokines 
such as Neutrophil activating peptide-78, GROa'MGSA, Crop, GROy, MlP-la, MIP-16, 
MCP-l, epidenml gro'vvlh factor, fibroblast growth fijctor, hepatocyhte growth factor, 
insulin-like growth factor, tlie interferons, the interieukins, kcratinocyte growth factor, 
leukemia inhibitor)' factor , oncostatin M, PD-ECSF, PDGF, pieiotiopin, SCF, c-kit ligand, 

15 angiogenesis factors {e.g. vascular eisdothelia! growth factore VEGF-A, VEGF-B, VEGF-C, 
VEGF-D, placental growth factor (PLGF), etc)., growth factors {e.g. G-CSF, GM-CSF), 
soluble receptors {e.g. IL4R, IL-13R, IL-iOR, soluble T-cell receptors, etc), and tlie like. 

Other preferred moieculcs of encoding include, but are not limited to 
transcription and expression activators. The transcription and expression activators include 

20 genes and/or proteins that modulate ceil growth, differentiation, regulation and the like an 
are found in prokaryotes ,viruses, and eukaryotes including fungi, plants and animals. 
Expression activators inchide, but are not Uratted to cytokines, inflammatory molecules, 
gtowth factors, grow (h factoi receptors, and cnicoacnc produets, jntei leukins {c g IL- 1 , IL- 
1,\L-Kctc nnttifcrons. F(jF, lG^-1. KiF-U. IrF PLXH . \n\ ,lGi-a. \\\)r-\\ FGk. iCCJF, 

25 SCR 'c-kit, CD40L GD40 \i v^: \ { \\\ , \< \v l f V-l. .mJ hvalm!n''CD44, sjynal 
tiansduction molecules and corrcsponJuig ..njo^.:t;:io p. jd.!cts, t c Mos, fL'VS, Ral, and 
Met, aid JTdnscnptionai act' s'titors and suprcs.sor-^, e , p5 ^. '! af. Fo^. Mv*.. Juu. Myb, Rcl, 
and steroid hormone receptors such as those for estrogen, progc^tcionc, testosterone, 
aldosterone, the LDL receptor ligand and corticosterone. 

30 Prefen-ed molecules for encoding in the methods of this invention also include 

proteins from infections or othCI^v^se pathogenic organisms e.g. proteins characteristic of 
Asper0ltissp„ Candida sp,, c<)Ii, Siaphyloccoi sp, Sirepiocci sp., Clostridia sp., Nms.seria 
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.^p , Lfite^oocK (i^nut. asp //i/ufiv "-j ♦ '^!>('sp ( upxli '^ai.'cf sp i'sachmoiids :>p , 
Lnapl I'nia Sy , Leg.ofielhi Spitviht' ■> \' iucchiiayi ■tiiiiom<.t^\p \'<>cafJhJ 
.\p , Chlamydia }-p , Rickettsia sp., Coxiella v h r^> i>^h . -p , Rot huliinai. a tf ucclla 
Yersinia, h, -K's^lfu, and PasiurelkK protozoa, \ uuse-'i ( - )ilN \ \ uuse^.l,-) RNA s irusts. 
5 Orthoinvxovifuses. dsDNA viruses, rctroviucs, etc. 

Stili other suitable molecules include nucleic acid and/or proteins that act as 
inhibitors of transcription, toxins of crop pests, industrialiy important eozymes {e.g. 
proteases, nucleases, and lipases) etc. 

Preferred moiecules include members of related "families" of nucleic adds or 

10 their encoded proteins, Relatedtiess {e.g. inchision or excUision from the "famiiy") can be 
determined by protein function and/or by sequence identity with other members of the 
iaraily. Sequence identity can be determined as described }>erein arid preferred family 
members share at least about 30% sequence identit\', more preferably at least about 50% 
sequence identity and most preferably at ieast about 80% st-quence identity. In certain 

15 instances, it is desirable io include molecules that have low {e.g. less thaji about 30%) 
sequence identity), but significant relaiedness. Such methods are well known in the 
bioinformatics literature and typically involve incorporation of molecular folding patterns 
with sequence/similarity information. One common impiementation of such an approach 
includes "threading algorithms*'. Threading algorithms detect remote homology by 

20 comparing sequences to structural templates. If the structural similarity between target and 
tcmpiatc is sufficiently large, their relationship can be detected in the absence of significant 
sequence similarity. Threading algorithms are well know to those of skill in the art and can 
be found, for example, in the NCBI Structure Oroup Threading Package {available from the 
National Center for Biological Information {see, e.g., http;//www.ncbi.nlm,nih.gov/ 

25 Structure/RESBARCH/threading.html) and in SeqFold (Molecular Simulations, Inc.). 

B) Encoding the btologtcal molecale into a charaeter string. 

The bioiugual itjolecuki^s) aic encoded mto charactei stnngi. In iIk sunpicbt 
uisiv.r>.v, tbo vhar^i^t^T stiing is ru il o ux' v iv'^icfor code used to repiescit the buJujivdl 
ir.o!ct.ulc ihus, im example, the charactcs st 'hl t.an tomprisc the charactct^ A L G, I ui 
30 U uheje a nucleic acid is encoded Simildrl\ die sia.idard anano at id noinent-latute can be 
used to represent a poiypeptide sequence. Akernativeiy, it wdl be realized that, to some 
extent, the encoding scheme arbitrary'. Thus, for example in the case of nucleic acids the A, 
-15- 
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C, o, i . O'- H can he repicj>ciited b\ the ntegtr-v ' 2, 1, 4 aid 5, tC'^pc4.J(\d> ?rd tht- nui-ieio 
dck^ can be . eprcst'ined .iri btnng of these mtegc s wiudi j-^ ilscM';' wtiglc ^^aibcit f>picdn'v 
large) itUeger Otbet coding ?jhcmcs are aiso possible. For cKJirpIs.. ina b!>!lo;iicaf 
molecule can he cucrdt-d nito a character string where each "subunii" nf the moiccuk- is 
5 encoded aito a njulti-characicr icpresentation. Alternatively variotJs coffipresseci 

representations are also possible (e.g., where recurrent motifs are represented only once with 
appropriate pointers identifying each occurrence). 

The biological molecules also need not be encoded into data structures that 
are discrete/single strings. More complicated data structures (e.g. arrays, linked li.sts, 

1 0 indexed structures including, but not limited to databases or data tables, etc.) can also be 
used to encode the biological moiecuic(s). 

Essentially any data structure capable that pennits input, storage, and retrieval 
of a representation of the biolopcal moiccu{c(s) is suitable. While these operations can be 
accomplished manually (e.g. with pencil and paper or card-file, elc), preferred data 

15 stnictures are data structures that can be manipuiaied optically and/or electronically and/or 
magnetically and thus permit automated input, storage and output operations (e.g., by a 
computer). 

OL Selecting snbstrin^s. 

in a preferred embodiment, the character string encoded biological molectiies 

20 provide an initial population of strings from which substrings are selected. Typically at least 
two, substrings are selected with one substring coming from each initial character string. 
Where there are more than two initial character strings, it is not necessary that every initial 
character string provide a substring as iong as at least two initial character strings provide 
such substring.?. In preferred embodiments, however, at least one substring will be selected 

25 from each initial string. 

Sabstriiig iength. 

There is essentially no limit on the maximiun number of substrings that can 
be selected from the initial strings other than the theoretical maximum number of strings that 
caii be generated from any given string. Thus, for example, the maxiraum number of 
30 substrings ."selected from an initial string is the number of string.? generated by a complete 
permutation of tlie initial string(s). 
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With an fmt!a5 i,ti r.g of iclativcly modest latgtii howo\t.-E, the rimiber ot 
pennulatioiiN ib quuc hiuh Jbu-, in pR-rcrred embudiracnts. 'hv' subvlriugs jrc sdcxicd iiom 
an mitiai stnrig such tha; iht; ^ ibv.-' ng>- do roi o^ ori.p Ivrcs-^^a .irot v*- ^vav. in a 
preferred <. mbodiment, the substrings from any one mitial stnng arc selected s>ucb that thosi; 
5 subsinngfe, if bgatcd in the correct order, y.onU reproduce the complete initiai string from 
which they are selected. 

Preferred substrings are also selected so as to not be unduly short. T>'pical!y a 
substring wiH be no shorter than the minim string length necessary to represent one sufaunit 
of tiie encoded biological molecule. Thus, for example, where the encoded biologicai 

10 molecule is a nucleic acid the substring will be long enough to at least encode one 

nucleotide. Similarly, where the encoded biological molecule is a polypeptide the substring 
will be long enough to at least encode one amino acid. 

In prefen-ed embodiments, the selected substring encodes at least two, 
preferably at least 4, more preferably at least 10, siil! suore preferabiy at least 20, and most 

15 preferably at least 50, 100, 500, or iOOO subunits of the encoded biological molecule. 
Substring length can be chosen to capture a particular level of biological 
organization. For example, a substring can be selected that encodes an entire gene, cDNA, 
mRNA, At a "higher" level of organization, a substring can be selected that encodes a series 
of related genes cDNAs, mRNAs, etc, as might be found in an operon, or a regulatory or 

20 sjTithetic pathway. At a still "higher" level of organization, the substring can be selected tiiat 
encodes the total nucleic acid (eg. genomic DNA, total mRNA, total cDNA) of an 
individual. There is essentially no limit to the "level of organization" that is captured in the 
substrings) as long as the initiai string from which the substring is selected encodes a higher 
level of organization, Th«.s, where the 8ubstring(s) are selected to encode individual genes, 

25 the initial string may encode entire metabolic pathways. Where the substring is selected to 
encode an individual's total nucleic acid, the initial string may encode the total nucleic acid 
of a population, etc. 

Conversely, the substring can also be selected to encode a subunit of a 
particular level of biological organization. Thus, for example, a substring can be used to 

30 select a particular domain of a protein, a particular region of a chrotnosome (e.g. , a region 
characteristically amplified, deleted or translocated), etc. 
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B) Substring sekxtimi algorithjBis. 

Any of a wide variety of approaches can be used to selected iJbe sttbstring(s}, 
the particular approach being detennined by the problem that beiag modeied. Preferred 
selection approaches include, but are not limited to random substring selection, uniform 
5 substring selection, motif-based selection, alignment-based selection, and frequency*biased 
selection. Tbe same substring selection method need not be applied to every initial chajracter 
string, but rather diiferent substtiiig selection methods caii be used for different initial 
strings. In addition, it is possible to apply multipie substring selection metiwds to any initial 
character string, 

10 L H aadom snbstrme selection. 

In one simple approach, the substring(s) can be selected randomly. Many 
approaches are available for the "random" selection of substrings. For example, where a 
siibstri ng(s) of minimum lengtli "L" are to be selected froni an ericoded character string of 
length "M", "cleavage points" can be selected using a random number generator producing 

15 integers (indicating position along the string} ranging from I. to M-l. (to avoid short terminal 
strings). "lotenial" substrings of length less than L are discarded. 

In another approach each position along the character string is addressed (e,g. 
by an integer ranging from i to N where N is the length of the character string). A minimum 
substring length "L" and a maximum substring length "M" are selected. Then a random 

20 number generator is used to generate a number "V" ranging from L to M, The algorithm 
then selects a substring from poshion I to V and position V4-1 become position 1 a^in. The 
process is then repeated ttntil the initial string is spanned. 

Oilier methods of randomly selecting substrings are readily devised. For the 
purpose of this invention, "random" selection does not require that the selection process meet 

25 formal statistical requirements for randomness. Pseudorandom or haphazard selection is 
suftlcieot in this context. 

2, Uniform substrin«j selection. 

In unitorm substring selection, tbe desired number of substrings to be 
obtained from each initial string is determined. The initial string is then uniformly divided 
30 into the desired number of substrings. Where the initial string length does not permit 
uniform division one or more shorter or longer substrings may be permitted. 

-18- 
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3. Motif-base d sekcdoii 

SLibsiringi can be scicticd from the inuial strings using molif-bascd scicction. 
In this apprpacb; ihe inifisl characier string(s) are scanned for the occurrctice of particitlar 
preselected motifs. Tht; substring is then selected such that tiie eridpoint(s) of the substring 
5 occur in a predefined relationship to the motif, Thus» for example, the eud can be within 
motif or "upstream" or "downstream" a preselected number of subunits from the end of the 
motifv 

The motif cart be completely arbitrary or it caii refJect the properties of si 
physical agent or biological molecule. Thus, for example, where the encoded biological 
10 molecule is a nucleic acid, the motif can be selected to reflect the binding specificity of a 
rcstnctioii endonuw,icasu k <f lv.oR\ HindlU BdrnTlT PmjTI c/r a piott.iii bnuimg site, a 
particular mnon'cxor. lUKtion d ti in>>po^ n -i u ) \^ n'.h v s liut. tiit tacoded 
biological molecule is a proiem, the motif can reflect a piotea.se bsndmg site, a piotesn 
binding site, a receptor binding site, a particular hgand, a complementanty determming 
1 5 region, an epitope, etc. 

Similarly, polysaccharides are can contain partitailar sugar motifs, 
glycoproteins can have particiiiar sugar motifs and/or particular amino acid motifs, etc. 

Motifs need not specifically reflect primary stmoture of the encoded 
biological molecule. Secondary and tertiary stnicture motifs are also possible and can be 
20 used to delineate substring endpoints. Thus, for example, an encoded protein may contain a 
characteristic a-heb"x, P-sheet, a-helix, motif and the occurrence of this motif can be used to 
delineate siib.string endpoints. 

Another "higher order" kind of motif can a ''meta-motif e.g.. as represented 
by a "fragmentation digest." In this approach, a substring eodpoint is not detennined by the 
25 occurrence of a single motif, but is delineated by coordinated pattern and spacing of one or 
more moiifs. 

Motifs can also be selected/utilized that do not strictly reflect seqtience 
patterns, bat ratlier the infonnation content of particular domains of the character strings. 
Thus, for example, U.S. Patent 5,867,402 describes a computer system and computation 
30 metliod for processing sequence signals by a Iransfonnation into an infonnation content 
weight matrix, as represented by Ri(b,l), A .second transformation foliows which apphes a 
particular sequence signal to the information content weight matrix, Ri{h,l) thereby 
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producing a \a!ufc, R., which comprises the .la v. a^j! ,'-iio"miUion content o* apariicular 
sequence signal Other .ipptoacbos to lb.' oeTC'^'v,^.,:^..n ot information ^.-onfenf of character 
strings are ;\ko ksiuwii ( ^.-lv ulsv StiJe\, - i'-^'^-i* Wk ice A.'icJs Rv;,, 12, 505-5 H': Schneidci 
(1994) Sar.oiecknohgy^ 1-8; Hennan e/a/. (1992) J. BacicnuL pp. 3558-3560; Schneidei 
5 ctul.{lWn Nucleic Acids Res., 18(20): 6097-6100; Berg, c/ (1988) J. Mol. Biol., 
200(4); 709-723). 

Other motifs that are contemplated reflect biological sigaals. Thus, for 
example, one motif delineating the end of a sabstring might be a stop codon, or a start codon 
in the case of an encoded nucleic acid, a methionine, or a polj^denylatioa signal in tlie case 
10 of a protein, efc. 

The same motif need not be applied to every initial sequence. In addition, 
multiple motifs, meta-motifs and/or motif/'raeta-raotlf combinations can be applied to any 
sequence, 

4. Aligaiaent-based selection. 

15 In anoihci^ approach substrings are selected by atigniag two or more initial 

character strings and choosing regions of higii identity between the initial strings in which to 
select the endpoints of the substrmg(s). Thus, for example, after a sequence alignment, 
substrings may be chosen such that the endpoint of tlie substr!ng(s) occurs within (e.g., in the 
middle of) a region having at least 30%, preferably at least 50%, more preferably at least 

20 70%, still more preferably at least 80%, and most preferably at least 85%, 90%, 95%, or 
even at least 99% sequence identity over a window ranging in length jSx)m at least about 5, 
preferably from at least about 10, more preferably from at least about 20, still more 
preferably from at least about 30, and most preferably from at least about 50, 1 00, 200, 500,, 
or even 1000 subunite. 

25 The terms "sequence identity" or "percent seqnence identity" or "percent 

identity," or percent "homology" in the context of two or more biological macromoiecules 
(i> ^ nutieit at ids oi pohufpuJe-^'* ciei to tv»-o or moje sequences or subsetjuences. that arc 
rhe vim: or n r 1. i s;k^ ' i u o' ^ nits't ^ a rii' ^ .\.d 'Csuiue^ ot nuckofac^) 

tikit a^e tht. same v\tteiUv)!npar(.a auJ jisjned \>£ nia\inuim (.oiiesponJcuv,e a-. nit.a->uicJ 

30 using one a sequence comparison algorithms or bv visxiai inspection. 

For sequence comparison, typically one sequence acts as a rekrence 
sequence, to which test sequences are compared. In a preferred embodiment, when using a 
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sequence comparison algori&in, test and reference sequences are input Into a computer, 
subsequence coordinates arc designated, if necessary, and sequence algorithm program 
parameters arc designated. The sequence comparison algorithm then calculates the percent 
sequence identity tor the test seqnence(s) relative to the reference sequence, based on the 
5 designated program parameters. 

AHgmnent and sequence comparison algorithms are well kmmx to tliose of 
skill in the art. For example, optima! alignment of sequences for comparison can be 
algorithms including, but not limited to the local homology algorithm of Smith & Waterman 
(1981) Adv. Apple. Matk 2:482, the homology alignment algorithm of Needle man & Wench 

10 (1970) ,/. Mol. Biol. 48:443, by the search for similarity method of Pearson & Lipan (1988) 
Proc, Natl. Acad. Sic. USA 85:2444, by computerized implementations of these algorithms 
(e.g., GAP, BESTFIT, FASTA, and TFASTA) in commercial modules and/br commercial 
software packages (e.g., the Wisconsin Genetics Software Package, Genetics Computer 
Group, 575 Science Dr., Madison, Wi), or by visual inspection {see generally Amusable ei 

15 aL. supm). 

One example of a useful algorithm is PILEUP. FILEUP creates a oiultipie 
sequence alignment from a group of related sequences using progressive, paii-wise 
alignments to show relationship and percent sequence identity. It also plots a tree or 
endogamy showing the clustering relationships used to create the alignment FILEUP uses a 

20 simplification of the progressive alignment method of Feng & DooUttie (1 987) J. MoL Evoi 
35:351-360. llie method used is similar to the method described by Higgins & Sharp (1 989) 
CAB/OS 5: 15 1- ! 53, The program can align up to 300 sequences, each of a maximum length 
of 5,000 nucleotides or amino acids. The multiple aiigmnent procedure begins with the 
pairwtse alignn ent of the two most similai sequence-, pioducing a cluster of two augned 

25 '-equeni.es Thisdustt !•> then aligned to the next most related sequence or cluster of 
aligned sequences 1 wo v.iusttrs of sequences jrc aligned b> a snnple c ctcnsion of ihe 
paiiwisc aiignnieni of two inJn idual seqtitn^cs 1 he fin il aimnmcnt is aLhie% ed h\ a series 
0} prowt'iMvc panwisc alignments The program is nm b> de ignaling spccifa equ^nct-^ 
uid th(.ii ammo acid or nucleotide coordinates fctr regions of sequence compari^Mi ma hs 

30 designamig the piogram paiamcte For cvample a ictt-tcnce sequence ca i be compared to 
otlier test sequences to determine the percent sequence identity relationship using the 
following parameters: default gap weight (3.00), default gap length weight (0.10), and 
weighted end gaps. 
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Aiiothti tXi'mpk of algoiitlim thai is sujtabic fo' dtieiniinmg percent 
bcqueiice laeiituv s;, Jiiencc siinilanty is ihe BLAST algmrlim, which is described in 
Ahschnl eta! (J990} J \u,( Biol 215 -.'V^ 4 sc J^vx;. to; pLilonuui^ BLAST analyses 
jspubljci; tnailal'iC thuugh the NvUso-u ( c-ntf sv B o\v.hnol(>j*v inlomtauon 
5 (http 'iv^'ww nci?i n)u5 oih iios ) This lUorr.iri r - o foM KlL-fiUfMiig high <5i.u!tiig 

sequeact; pair^ (HSPs) b> iauitif>in,4i short words uf length VV m the qucn sequcnoe, which 
either match or satisfy -^oitil positive valued threshold score T when aiisued with a woid of 
the sanie length ui a database sequence. I" js referred to as the neighborhood word score 
threshold (Aitschtil et ak supra). These initial neighfaorhood word hits act as seeds for 

10 initiatiiig searches to fmd longer HSPs contammg them. The word hits are then extended m 
both directions along each sequence for as far as the cumulative alignment score can be 
increased. Extension of the word hits in each direction are halted when: the ciimulative 
alignment score falls offby the quantity X from its raaximuitj achieved value; tlie cumuktive 
score goes to zero or below, due to the accumulation of one or more negative-scoring residue 

1 5 alignments; or tlw end of either sequence is reached. The BLAST" algorithm parameters W, 
T, and X determine the sensitivity and speed of the alignment. The BLAST program uses as 
defaults a wordlength (W) of 1 1 , the BLOSUM62 scoring matrix (nee Henikoff & Henifcoff 
(1 989) Proc. Nad. Acad. Sd. USA 89: 1091 5 } alignments (B) of 50, expectation (E) of 10, 
M=5, N~-4, and a comparison of both strands. 

20 In addition to calculating percent sequence identity, the BLAST algorithm 

also pertbnm a statistical analysis of the similarity between two sequences {see, e.g., Karlin 
& Ahschul (1993) Proc. NaiL Acad ScL USA 90:5873-5787). One measure of similarity 
provided by the BLAST algoritlun is tiie smallest sum probability (P(N)), which provides an 
iiidication of the probability by which a match between two nucleotide or amino acid 

25 sequences would occur by chance. For example, a nucleic acid is considered similai' to a 
reference sequence if tite smallest sum pix)bability in a comparison of the test nucleic acid to 
the reference nucleic acid is less than about OJ, more preferably less than about 0,01, and 
most preferably 1es.s than about 0.001 , 

The above-identHlcd .sijnilarily algorithms are intended to be exempiary and 

30 not limiting. It will be appreeiaied that similarity can be determined across the fidi length of 
the initial character strings or it can be restricted to particular subdomains. 
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5. FrM»<?R c y-biased selfcctio n. 

In frequency-ljiased subsequence selection methods, the subsequences are 
selected sucti that the endpoints of ti^e subsequence(s) occur in a particular relationship to 
subsequctjce domains that meet a particular presdecied frequency criteriont. For exampie, 
5 where it is desired to exclude encoded biological molecules that contain highly repetitive 
subuait patterns {e.g. m the case of a nucleic acid, a high concentration of AC repeats such as 
"ACACACACACAC*'), the subunit selectioa can be designed to create an endpoint prior to 
the occurrence of a particular repeat density of tlie particular subunit or motif of subunits. In 
this instant a repeat density is the number of occurrences of a subunit or subunit motif per 
1 0 character stiing length measured in subunit number or lengths of the subunit motif 
respectively. 

Thus, in the exampie suggested above, the substring can he seiectcd such that 
the substring endpoint occurs adjacent to a character string dojiiain in which the AC motif 
occurs at a frequency over 0,5 (50%) over a length of at least e.g. 4 motif lengths (in this 

1 5 case 8 subuni t lengths) . 

An other example, of such a selection is a substring selection based on the 
occuirence of a particular subunit at an occurrence of 100% over at least X stibunits. Thus, 
for example, where the encoded biological molecule Is a nucleic acid and the subimit is 
adenosine "A", the frequency-biased selection may set a substring endpoint at the occurrence 

20 of a polyadenylation signal (e.g., AAAAAAA). Depending on the design of tihe irequency- 
biascd substring selection criterion, equivalent results may be obtained using a motif-based 
selection scheme as described above. 

6. Other,crlten ^ ^^ 

Numerous other criteria can be used to influence and/or determine the 
25 selection of particular substrings. Such criteria include the predicted hybrophobicity and/or 
PI and/or PK of the moiecisie encoded by the substring. Other criteria include the cross-over 
number the desired fragment size, the substring length distribution, and/or rational 
inforination regarding the folding pf the molectile(s) encoded by the s«bstring(s). 

IV. Coneateaaiin^ substriuss. 
30 Once populations of substiings are selected fi-ojn the initial strings, the 

substrings are concatenated to produce new strings of approximately or exactly the same 
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length as the parent initial strings. The stnng concatenation can be pcrfbnned according to a 
wide number of methods. 

In OHi: embodiment, ;hL -u.—'j nus ^ c o,,m^y i^oncaicnatKii to produce 
"recombmed" stnnrs In one appro.irh o u c i "-a «Jun\" u->iKatcr.auun. each subsuing is. 
5 asiiigned a unique idontificr < an intege; 07 oihet idcmifici). T he ideutifier-i arc then 
randomly selected frorn the pool {e.g. using a random number generator) and the 
subsequences corresponding to those identifiers arc joined to produce a concatenated 
sequence. When joined subsequences are approximately ore exactly tlie length of tlis 
starting character string(s), the process is started anew to produce another string. The 

1 0 process is repeated until all of the substrings are utilized. Alternatively the substrings can be 
selected without withdrawing thera from the "substring pool" and the process is repeated 
until a desired number of "fiiU-length" strings are obtained. 

In preferred embodiments, however, it is desired to maintain the relative order 
of substrings .forming the concatenated strings as existed in the initial strings. This can be 

15 accomplished by any of a wide number of means. For example, each substring selected from 
a parent sttmg can be "tagged" with an identifier (e.g. a pointer) identifying the position in 
tlie imtial siring of that substring relative to the position of the other substrings derived from 
that parent string, Substrings derived frotn corresponding positions in other initial strings are 
assigned similar positional identifiei-s. This approach is illustrated in Figure 2 where three 

20 initial strings (designated A, B, and C) each give rise to five substrings numbered 1 tlu-ough 
5. Each substring can be uniquely identified (e.g., AI, A2, . . . A5, Bl, B2, . . . B5, Ci, C2, . 
. . C5) as illustrated. A concatenated string can then be produced by randomly selecting a 
substring from pool 1 (consisting of Al, Bi , and C2), a substring from pool 2 (consisting of 
A2, B2, C2) and so on though pool 5. Tiiis process can be repeated until three strings are 

25 reconstructed. 

In thi-; concatenation , cheme, once a s^bstriny is concaienated it is removed 
trom the siibstnng pnui However, the cotKak-nat'or cr. be at^complishcd b> "^vpy tng" the 
iubfet'Otietive liotn the pool and thus utiiizmg il m a concalenaied sequence s.\hiic ^tdl 
refamir.g the :^ub^rnng aA ailability for subsequent concatenations. This permits greater 
30 diversity to be generated. 

In other embodiments, various alignment and/or similmty algorithms can be 
used to generally maintain the relative sequence of tlie substrings during the concatenation. 
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\n tins appioauh. .subsequences are u.-s ^^uc^' a -'c/iS \l p'isition in {he concatenated sequence 
by associiitmg rcgions ofhigh S!mi.<i:r; ^cc or l'gu"c3). 

In picrt-ired emboairnciN the "-ni-a! orv bMlotjicnl mnleoules bear some 
reiauunship u ith each oUifi Thus, ior eXfUiipje, where tht, encoded moiccuics represent 
5 members m a particular cn7>me famdy, molecules represent indj\ iduals from u paiuctilar 
popuiaiion, cic. The subsequences are expected to share domains of significani smiikriiy. 
In addition, critical functional domains wiil tend to be conserved and therefore also increase 
the simiiarity of particular domains of the subsequences. Thus, aligning regions of high 
similarity between subsequences will tend to reconstruct the relative order of the 

10 subsequences to reflect their order m the initial strings, 

it will not required that perfect order be estabhshed in ever>' concatenated 
character string. That a percentage (e.g. preferably at least 1 percent, more preferably at 
least 10 percent, still more preferably at least 20% and most preferably at least 40 percent, at 
least 60% or at least 80 percent) of the concatenated sequences preserve the original order is 

1 5 preferred. 

The use of simiiarity measures to rc-ofder the subsequences is similar to 
sequencing by hybridization (SBH) methodologies in wliich similarity algorithms are used to 
reconstruct nucleic acid sequences from fiagineuts of the complete sequence (.vee, e.g., 
Barinaga (1991) Scimce, 253: 1489; Bains (1992) Bw/Techmhg}> H): 757-758; i)rmanac 

20 and Crkvenjakov, Yugoslav Patent Application #570/87, 1987; Drmanac ei al ( 1 989) 
Genomics, 4: 1 14; Strezoska et al (1991) Proc, Natl. Acad. Sci. USA 88: 10089; and 
Drmanac and Crkvenjakov, U,S. Pat. No, 5,202,231). 

It will be appreciated that certain concatenations aione, or selection and 
concatenation operations together can be represented by particular operators. Certain 

25 operators of this kind arc known in genetics algoriLhms. 'Thiis. for example, a "crossing 
over*' (reciprocal translocation) operator can be defined in which subsequences at a similar 
position in two different initial sequences are excliangeci. Similarly ''linkage" operators can 
be defined that link particular subsequence» ;n cross-over evenis so that the subsequences 
crossover together (whether or not tiiey are adjacent subsequences), hi view of the foregoing 

30 discios\ire, other operators will be known to those of skill in the art. 
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V) Adding the produ ct strings a eoilecii on of strings. 

The concatenated strings produced by the methods of tliis mventioo are added 
to a collection of strings that forms the "populated dataset". The striags in tlus ix^llection can 
he used as initial strings in further iterations of the methods described herein (see. Figure 1). 
5 The addition, in this context refers to a process of identifying one or more strings as included 
within a set of strings. This can be accomplished by a variety of means including, but not 
limited to copying or moving the string(s) in question into a data structure that is a coliection 
of strings, setting or providing a pointer from the string to a data .strucfcure that represents a 
coliection of strings, setting a flag associated with the .string indicating its inclusion in a 
10 particular set, or simply designating a rule that the string(s) so produced are included in the 
collection. 

Once one or more concatenated character strings are generated, a selection 
criterion can optionally be imposed to determine whether or not tlie concatenated strings are 
to be included in the collection of strings [e.g. as initial strings for a second iteration and/w 
15 as elements of the populated datastt'iicmre). A wide number of selection criteria can be 
utiHzed. 

hi one embodiment, a siiniiarity index can be used as a selection criterion. 

Tiius newiy generated concatenated character strings must share a particular predefinined 

smiilarit>' (e.g. greater than 10%, preferably greater than 20% or 30%, more preferably 
20 greater than 40% or 50% and most preferably greater than 60%, 70%, 80%, or even 90%) 

with each other and/or with the initial strings {or the encoded molecules) and/or with a one or 

more "reference" strings. 

Selection can also involve the use of algoritlims that evaluate "relatedness" 

even when sequence identity is quite low. Such methods include "threading" a!gorithm.s 
25 and'or co\ at nnce measure^ 

Othe£ 5.ekLi.(.> c^ei a ljh require that the moleculeCs) represented by the 

comatcnated stiinL'S meet certain computationally predicted properties. Thus, for example 

si-icctior cntcna couia require a imnmuini or maximum molecular weight, a ccilain 

minimum or maximwn free energy in a particular buffer .*!ystem, a minimum or tnaximum 
30 contact surface with a particular target molecule or surface, a particular net charge in a 

certain butter system, a predicted PK, PI, binding avidity, particular secondary or tertiary 

form, eta 
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Still other selection catena can tequiie th<n the niokxukts) toprt-sejiied by the 
concatenated s.lrmgs meet certain entpincal phystcn!l> <Kfavct.l piopcrtie^ Thu.s. for 
example, selection criteria cc ul.- -tqujc I'l.-t nu-iccj,c icprtocnied by tbt concatenated 
strmg have a certain tcmperatute si-i^r-.. cvO < c! /miui-o activity, produec <i pojutjon rf.i 
5 pd!t,L.ularpH h^n e a parttoi.,.*- kvs- .1 > ^ pH iip' ma, u \e o ininHiutm i^i 

maximum solubiht> in a paituu.ar ^. K!,it sv^;s;in, biml a target molecule witli a mmjmuni 
0! tnaximiitn aftinitv, and so fuuh. The phvsicai dttermination of particular selection criteria 
lypicalb tequires tiiat the nioiccuic(s) represented by the concatenated stnng{s) be 
synthesized {e.g. clietiiicaUy or by recombinant methods) or isdated, 

10 The application of such selection eritem in physical systems is ktiown to 

those of skill in the art {see, e.g., Stemmer et aL, (1999) Tumor Targeting 4: 1-4; Ness et al 
(1999) Nature Biotecknohg}' 17: 893-896; Chang etaL (1999) Natwe Biotechnology 17: 
793-797; MinshuO and Stemmer (1999) Curreni Opmion in Chemical Biology 3: 284-290; 
Gimstians €■/ M (1999) Nature Biotechnohg}' 17: 259-264; Crameri e? ai (1998)M;jtere 

15 391 : 28S-29I; Cfameri et al, (1997) Namre Bioiecfmology 1 5; 436-438; Zhang et al (1997) 
Proa Nail. Acad. ScL, USA. 94: 4504-4509; Patten eta!. (1997) Curr. Opin, Biotech. 8: 724- 
733; Crameri et al{l 996) Nature Med. 2 : 1 00- 1 03 ; Crameri etal.n 996) Nature 
Biotechnohg}' 14: 315-319; Gates ei ai. (1996) J. Mol. Biol. 255;373-386; Steirmier (1996) 
Crameri and Stemmer (1995) BioTechnimes i 8: 194-195; US. Patents 5,605,793, 5,81 1,238 

20 5,830,721 , 5,834,252, 5,837,458, WO 95/22625, WO 97/0078, WO 97/35966, WO 

99/41402; WO 99/41383, WO 99/41369, WO 9941368, EP 0934999; EP 0932670; WO 
9923107; WO 9921979; WO 9831837; VYO 9827230, and W09S 13487). 

VI. lotroductioa of additional variation. 

In certain instances it is desired to introduce additional variation into the 
25 population . This is particulariy desired where repeated iterations of an evolutipMry 

algorithm usihg the initial population generated by the methods of this invention does not 
provide a solution to the modeled problem {e.g. no member meets a selection criterion). 

Many methods can be used to introduce variation into the string population 
generated according to the method.*? of this invention It ts noted tlut vai uitiotj eaii bt 
30 introduced into the initiai stnng(s) (input to the raetroe! > o; mio the concatenated nnngts j 
(output). Preferably such variation will be introduced prior to a selection step, (.vei iJi 
certain cases, variation may be introduced after selection (e.g. before a second iteration). 
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In oae approach, a stocfaastic operator is mtfoduced into the dgaritjim that 
randomly/haphazardly alters the one or more subunits comprising mi encoded molecule. It is 
noted that variation can be introduced into the imencoded molecule (which is then re- 
encoded into a character istring) and/or the variation can be introduced directly itito the 
5 encoded character string. The stoch;i?t5C operator typicaiiy invokes two seiectiori prxKcsses, 
Otic sck'ction process involves the determination of which subunit(s) to alter, while oiher 
seiccttoa process involves a seiection/determination of what rhc sabiifut(s) are to bs altered 
into. Both selection processes can be stochastic or alternativeiy on selection process or the 
other can be detenninant. Urns, for example, the selection of the subunit(s) to "mutate" can 
10 be random/haphazard, but the mutation can always be into the same new/feplacemem 

subunit. Alternatively, the particular subunits that are to be mutated can be pre-determined, 
but the selection of the mutated/resultant subunit can be random/haphazard. Still in another 
embodiment, both the selection of the subunit to mutate and the result of the mutation can be 
random^haphazai^. 

1 5 In preferred embodiments, the stochastic operator will also take as an input or 

parameter a "mutation frequency" that sets the average frequency of occurrence of a 
"mutation". Thus, for example, where the mutation frequency is set at 10%, the stochastic 
operator wiU only pcnnit a niulation in one out of 10 subimits comprising in the initial 
Strings, The rautatioa frequency can also be set as a range (e.g. 5%-10%, etc.). 

20 The "stochastic operator" need not be applied to every mitial string nor to 

every substring comprising an initial string. Thus, in certain embodiments, the action of the 
stochastic operator will be constrained to particular initial strings and^or to particular 
substrings (eg. domains) of one or more initial strings. 

Vi/Tiere both selection criteria of the stochastic operator are tixed, the operator 

25 is no longer stochastic, but rather iiitroduces a "directed mutation". Such an operator may 
direct that every subunit "A" that the operator encounters is changed into a subunit "B". The 
dnet-tcd tiuttduuti oper-ui-r <.dn bt.U tike amutattor ftctucr a paiameleLaUnbine jnpui 
Ai, det-cubed abo\ c, the rautas'tu Ir^o .e .is ^ , . mh i n ocr "cncount>.'cd" sjbunii^ 
tiiat the operatot actuaJ> <^ o i 

30 it will albo hi apprtvi a< d . > tl l -,vc!Ufln. opeiaior, as desctibed abo\c, can 

alter moK xhm a sms^le encoded ^ubunsi In certam embodiments, the operator alters 
multiple encoded subunits or even entire substrings/domains. 
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Variation can also be mtroduced by the use of ms>ertion or deletion operators. 
Insertjon or deletion operators are essentially vanants of "stochastjc mutation" operators. 
Instead of transforming one or more subunits, a ddetton operator rerao\ os one or itwre 
subimt^., while an insertJon opeiator msens one or more subiuiits .Agruj dcSotiofi ana 
5 Jr=Cil30n t'ptiattus: hd\c t\^o selectum prncc^^ses. One pioce^s tijat selects ihc Mte of the 

insertion or deletion <ind anotbei prut. ess that {,tiect*- the ^uc of the deletiou or the idcnistj of 
tlie insertion One or both iciection piocesses can be stochastic Vt'herc both selection 
processes are predetermined (non-stochastic) the nisertion or deletion opcratom aie directed 
insertion or directed deletion operators. As with the stochastic operator, the inseilion or 

l i) deletion operators can take a mutation frequency as a parameter/attribate/input. 

In another embodiment, variation can be increased by adding cite or more 
initial strings that are randomly or haphazardly generated and bear no necessaiy relationship 
to the initial strings derived from biological molecuie(s). The variation-introducing initial 
stringfs) can be produced as a srxictly random or haphazard string or, in cenain 

J 5 embodiments, the variation string{s) are produced according to cenain predetermined criteria 
(e.g. freqnency of occurrence of particular subunits, minimum and' or maximum degree of 
similarity with the encoded strings, ek:). The variation-introducing initial strings need not 
be full-length strmgs, but can also simply include oise or more substrings, it will be noted 
that strings or subslringy of this nature can be used to reduce variation as well Thus, where 

20 a paiticular molecular domain is "favbred" strings or substring(s) encoding this domain can 
be added to the population of initial strings. 

yiL Popniatiag a data structare^ 

in one embodiment, all tlie concatenated string(s) produced by the metitods of 
tiiis invention are used to populate a data strucmre and/or are used as inhial strings in another 

25 iteration of the methods described herein. In other embodiments, selection criteria are 

imposed as described above, and only concatenated strings meeting the selection criteria are 
used as initial strings and/or are usc*d to populate a data structure. The data structure can be 
populated with the concatenated representation of the encoded molecule(s) used in the 
above-described mampidations, or aheniativeiy, the concatenated strings can be partially 

30 deconvolved to reproduce as simpler encoded or direct representation of the encoded 
biological molecules and these deconvolved strings carj be used to populate the data 
structure. 
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In one embodiment, ihe data structure can be as simple as a piece t.)f paper 
hjvjng The concaietKUcd strings xvritteti out on it or a collection oi' cards each curd listing one 
or more of the eonei;te-^u:*c>! iTnr.gs hi a p: crerrcd cir.b. >dirncnt, the data striictiirt.' is 
cnjbodied in niedia ii-- roecharn\\tl cuio i\\nd ano'oi optical and-or qiiaiitnm and^ot 
5 magnetic and or electronic) that pennit manipulation of tlie data structure by an appixtpriately 
desigiied computer, in particularly preferred erabodimetitSs the data structure is formed in 
computermemory {e.g,, dynamic, static, read-only, eic.) and/or in optical, magnetic, or 
magneto-opticai storage media. 

The data structurej even in a computer accessible fonn, can simply provide a 

10 list of the concatenated strings. Ahematjvely, the data stmcture can be structured to preserve 
relationships between the various "entries". At a simple level this can entail maintaining a 
simple identity and/or order of entries. More sophisticated data strudnres are also available 
and may prtwide ancillary structures for indexing and/or sorting and/or maintaining 
relationships between one or more entries in the data stnictttre (e.g., concatenated strings). 

15 The data structure can additionally contain annotations regarding the entry (e.g. origin, type, 
physical properties, etc.% or Unks between an entry and an exiemai data source. Preferred 
data structures include, but are not limited to lists, linked lists, tables, hash tables and other 
indexes, fiat-file databases, relational databases, local or distributed computation systems. In 
particularly preferred embodiments, the data structure is a data file stored on conventionai 

20 (e.g. magnetic and/or optical) media or read into a computer memory. 

yill. Embodiment in a Fro^rrammed Dig ital A pparatus 

The invention may be embodied in a fixed media or tiBusmissible program 
component containing logic instructions and/or data that when loaded into an appropriately 
configured computing device cause that device to populate a data structure (e.g. generate a 

25 pooi/coUecdon of concatenated strings) according to the methods of tliis invention. 
Figure 4 shows digital device 700 that may be understood as a logical 
apparatus that can read instructions from media 717 and/or network port 719. Apparatus 700 
can thereafter use those instructions to direct a encoding of biological molecules 
manipulation of the encoded representationfs) of the molecules and population of a data 

30 structwe, One type of logical apparatus that may embody the invention is a computer system 
as illustiated in 700, containing CPU 707, optioital input devices 709 and 711, disk drives 
715 and optional monitor 705. Fixed media 71 7 may be used to program such a system and 
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could represent a disk-type optical or magnetic media or a memoiy, Cantinunscation port 
719 may also be used to program such a system md could represent any t>pe of 
communicaiion coimection. 

The mventior a iiu'^ ' » ..M^i^died vviilinj the tircuitr* of an appKt'ition 
5 spcL iOc initgraied lsi Lutt (ASIC) j piourajirTiabJe logic device (PLD) In such ? cast-, rne 
invention may be embodied m a computer understandable descriptor language which may be 
used to create an ASIC or PLD that operates as herein described. 

The invention also tnay be embodied within the circuitry or logic processes of 
other digital apparatus, such as cameras, displays, image editing equipment, etc. 

10 IX. Kmbodiment in a web site. 

The methods of this invention can be implemented in a localized or 
distributed computing environment. In a distributed environment, the methods may 
implemented on a single computer comprising multiple processors or on a muHtpUcity of 
computers. The computers can be linked, e.g. through a common bus, but more preferably 

15 the coraputer(s) are nodes on a network. The network can be a generalized or a dedicated 
local or wide-area network and, in certain preferred embodiments, the computers may be 
components of an intra-net or an intensct 

In a preferred internet embodiment, a client systei« typically executes a Web 
browser and is coupled to a server computer executing a Web server. The Web browser is 

20 typically a program such as IBM's Web Explorer, or NetScape or .VIosaic. The Web server is 
typically, but not necessarily, a program such as IBM's HTTP Daemon or other WWW 
daemon. The client computer is bi-directionally coupled with tlie server computer over a 
line or via a wireless system. In tarn, the server computer is bi-directionally coupled with a 
website (ser\'er hosting the website) providing access to sotbvare implementing the methods 

25 of this invention. 

A user of a clien t coimected to the Intranet or internet may cause the client to 
request resources that are part of the web sitc(s) hosttni? the nppljcation(s) providtny -au 
implementation of the methods of this invention Senf t p:o\j£rin{<:) then proLCs,!, the it\]iiest 
to reium the specified rcvourees (j"?<«umnig thcv aie :.;irentl\ a\uiiabie). .\ standard nammg 

30 convention has been adoptee, kaov, a; z l nu. n.' Re^oari-c Uxvitor ( "t^fRL* 'i This 
convention encompasses several types of location names, presently mcluding subclasses 
such as Hypertext Transpoit Protocol ("http"), File Transport Protocol ("ftp"), gopher, and 
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Wide Area Information Service ("WAIS"). When a resource is dowaloaded, it may include 
the URLs of additional resources. Thus, the user of the client can easily ieam of the existence 
of new resources that he or she had not speciilcaliy requested. 

The software itnpiementing the me;thod(s) of this invention can run locally on 
5 the server hosting the website in a true ci ient-.sen er architecture. Thus, ihe client coinputer 
posts requests to the host server which runs the requested process{es) localiy and then 
dowfnloads the resuits back to the client. Alternatively, the methods of this invention can be 
implemented in a "raulti-tier" format wherein a component of tiie method(s) are performed 
locally by the cUeilt. This catt be implemented by sofMare downloaded froiii the server oh 

1 0 request by the client (e.g. a Java application) or it eah be implemented by software 
"penaanently" installed on the client. 

In one embodiment the appUcation(s) implementing the methods of this 
invention are divided into firamcs. hx this paradigm, it is helpful to view an application not 
so much as a collectioh of features or functionai ity but, instead, as a coii ection of discrete 

1 5 frajnes or views. A typical application, for instance, generally includes a set of menu items , 
each of with invokes a particular frame -that is, a form which manifest certain functionality 
of the appHcation. With this perspective, an application is viewed not as a monolitliic body 
of code but as a collection of applets, or bundies of Innctionaiily, In this manner from within 
a browsjer, a user wotild select a Web p^ge liiik which wouldj in turn, inyoke a particuiar 

20 frame of the appiicatioti stibappltcation). Thus, for example, one or more frames may 
provide fimctionality for inputing and/or encoding biological molecule(s) into one or more 
character strings, while another frame provides tools for generating and/or increasing 
diversity of the encoded character string(s). 

In addition to expressing an application as a collection of frames, ati 

25 application is also expressed as a iocitmn on the intranet and \>i Inft-inet, a URL (Universal 
Resource Loc^tort addtess pointing tlic application. Each URL preferably mchides two 
chaiactcnstscs Cfjutent data lot the URL {• l a hatcvcr data is stored un ihv or) logi-iher 
v,iih a data t\pe oi MIMb ' vK /'r-. :'->os.^ i. ic^.o! Mail lixtension) t>pc 1 he d^ta npc 
;dknvs a \S ch hu)v\^tT to d^xtni v lKu^ st ' i jn..!r[0 .Cvu.Ned m a se;\er (t,',^' , 

30 such as uitcrprenng a .iiif file as a bitmap unagej. In cttect, this serves a.s a de:,cnrtion of 

what to do with tlie data once it is received at the browser, if a stream of binary data is 

received as type HTML, the browser renders it as an HTML page. If instead it is received 

type bitmap, on the otlier hand, the browser renders it as a bitmap image, and so forth, 
-32- 
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In Mici jM>ft Wmdows, dsffeicnt techniques exist loj alkiwing a ho.n 
applicdnon to K'gt^tei an tnierest m adata object {i c daid of a partKuiai ispt:) Ont 
technique .oi the apphoation jo 'ol' ^ o- 1- 'a n rr lu-csf iti .i pattjcubt ilk= 
exk'itiioii toj an trg doc- '^^t. tosolt v-^ os^ Do^ A ) *hs-> i-> ihc mo^r common 
5 tcdiniquc ernpl<>\fd b\ Window apphcauon^ Another appro leb emplo\cd ui MiOiO<;ofx 
Object L inking and Enjijedded (OLE), is the i3j.e of a tiass Globaily Unique Idontilici or 
GUID--a 16-byte identifier for indicating a particular server application to invoke (for 
hosting the docianent having the GUID). The dass ID is registered on a panicular machine 
as being connected to a particular DLL (Dynamic Link Library) or application server, 

1 0 In one embodiment of particular interest, a technique tbr associating a host 

application with a document is through a use of MIME types, MIME provides a 
standardized technique for packaging a document object. It includes a MIME header for 
indicating which application is appropriate for hosting the document, all contained in a 
format suitable for transmission across the Inteniel. 

15 In one preferred embodiment, the metliods of the present irsvention are 

implemented, in part, with the use of a MIME type specific to tlie use of the methods of this 
invention. The MIME type contains information necessary.' to create a docimient (e.g., 
Microsoft ActiveX Document) locally but, in addition, also includes information necessary 
to fmd and download the program code for rendering the view of the doctunent, if necessary. 

20 If the prograni code is already present locally, it need only be downloaded for purpose of 
updating the local copy. This detmes a new document type which includes information 
supporting downloadable program code for rendering a view of the document. 

The MIME type may be associated with a file extension of .Al*t\ A file with 
the .APP extension is m OLE Document, tmplemented by an OLE DocObject. Because the 

25 .APP file is a file, it can be placed on a server and Unl^ed to using an HTML HREF, The 
.APP file preferably contains the following pieces of data: (1) the CLSID of an ActiveX 
object, which is an OLE Document Viewer implemented as one or more forms appropriate to 
the use of the methods ofdiis invention; (2) the URL of the codebase where the object's code 
can be found, and (3) (optionally) a requested version number. Once the .APP DocObject 

30 handler code is installed attd register.s tific APP MIME type, it can be used to download an 
.APP file into the user's Web browser. 

On the sender side, since the .APP file is really a file, the Web server simply 
receives the request and returns the file to the chent. When the APP file is downloaded, the 
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APP DutObiei.t bajkiJci asK ihi. r f ^..m < do ' plmd fho co k*oasc foi the obtctl 
speciiscd in The \PlM5le lh'ss\sc>-it vtc^ n\ ^' 1 ^ e W mdows through the 
CuOHCiassObjcctJ-iomUTr .1 \i n f . UiU.X -) eu ^ . kL-^v -duAnkwdea, 

the APP DocUb}e;.lhdndki a^k^ a 0 bro ^.ser ^ j ,o\k f.i . d., rot in<suni.c b\ 
5 ealhng ihe Acu vsUcMe method on the Expio or document uxc [ he Intmiot i \piorci then 
calls the DoeOojcct bacK to msfat tiate a vje\t, whxh Jt does by cieaung an mi*tan;.e ot the 
ActiveX view object from the code that was downloaded. Oiiee created, the ActiveX view 
object gets in-piace activated in the Internet Explorer, which creates the appropriate fonn and 
ali its child controls. 

10 Once the fomi is created, it can establish connections back to any remote 

server objects it needs to perfonn its functions. At this point, the user can interact with the 
form, which will appear embedded in the Internet Explorer frame. Wlieu the user changes to 
a different page, the browser assumes responsibiiit>' for eveniually closing and destroying the 
forni (and relinquishing any outstanding connections 10 the remote servers). 

15 In one preferred embodiment, from an end-user's desktop, the entry point to 

the system is tlie cor{>oraie home or the home page of another pattieuiar web-stie. The page 
can, optionally, include, in a conventional mamter, a nuittber of links. In response to the user 
clicking on a particular link to an application page (e.g. a page providing the functionality of 
the methods of this inventioii), the web browser connects to the application page (file) 

20 residing on the server. 

In one embodiment, where the user requests access to the methods of this 
invention, the user is directed to a particular page type, e.g., an application (appdoc) page for 
in-piace execution of an application (implementing one or more elements of tlie methods of 
thib mvention) m the Web browser Smce each apphcation page is located u<5ing an URL, 

25 other page-i can ha\e h\p<.i < M i) p,^i ■»age'=; m be gtouptd tf^geiht.t by 

maknigatataKV pdjc thnt vin*' -K'^ >u -'■^iiv.attoo p iges Wiien tht uso 

belecis a h\peihnk that posnli to to an app! l u or p ge tne Web browser downloads the 
appiRdtitm code and executes tlic page in'^iUe "-nc eiOs\<;ej 

Upon the biOs\s(.i du^^^^oad^^J thi. inpln-ilif n paje the bto sset (based on 

30 the dehned MIME type) invokes a local handier, a handler for documents of a type, ore 
particularly, the application page preferably includes a Globally Uniqiie Identifier (GUID) 
and a codebase URL for identifying a remote (downloadable) application to invoke for 
hosting the document, Gi%'en the document object and the GUID which arrive with the 
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application page, the local .handler looks lo the clieat inachine to see if the hostuig 
application already resides locally (e.g., bv exaiiiiuing Windows 95/^7 registry). At this 
point the local handler can choose to invoke a local copy (if any) or dowjiioad the latest 
version of the host application. 
5 Different models of dow'nloading code are con»iionl y available. When code 

is dowjiioaded, a "code base" specification (file) is initially requested from t}.ie server. The 
code base itself can range from a simple DLL file to a Cabinet file (Microsoft .cab file) 
containing multiple compressed files. Stili further, an information (e.g., Microsoft .inf) file 
can be employed for instructing the client system how to install the downloaded application. 

1 0 These mechanisjTJS afford great flexibility in choosing which component of an appiicadon 
gets downloaded and when. 

For preferred embodiments, the raachinery employed for actHaily 
downloading program code itself reiics on standard Microsofs AcHveX API (Application 
.Programiiiing ljiteriace)-calis. Although {he ActiveX .'\PI does not provide native support 

15 for Web-delivered applications, its API can be invoked for locating the correct version of the 
program code, copying it to the local machine, verifying its integrity, and registering it with 
the clients operating system. Once the code has been downloaded, the handler can proceed 
to invoke the now-present application host for rendering the document object (in a manner 
similar to invoking the hosting application through the registry if it were already installed). 

20 Now that the hosting application (OLE server) is loaded at the chent, the 

client system can employ the OLE document view architecture to render the application 
correctly witinn tlie browser, including using conventional OLE methodology for adding the 
application's menu to that of the browser and for correctly re-sizing the application upon a 
le si2«. of the browser ! as oppose to reuuir'n^, the applii.<ition to execute withm a smglc 

25 Active \ iontrul jectanjk ' ^ hi i. mss % noted) Onto tiie appbt-^tjon is 

exccuijii„ at tie cheat, u cat, t-\c(.ute lemote iogic such as usinsj RPC (Remote Pioccdure 
Call) meihodologv. In this manner logic which is preferably implemented as remote 
procedure(s) can siili ne used. 

In particular preferred embodiments, the methods of this invention are 

30 implemented as one or more ftaines providing the following functionality. Function(s) to 
encode two or more a biological molecules into character strings to provide a collection of 
two or more difTerent initial character strings xvherein e 

comprises at least about 10 subunits; functions to select at least two substrings from the 
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character strings; functions to concatenate the sisbstringv; to form one or oiore product firings 
about the same length as one or more of (he initial char:ictcr strings; and tunctious to add 
(place) the prochict strings to u ooiiccnon ol -iti inas. 

The futictions. to eacodc- two or inorc bi<.>;uyical molccisies preferLLbiy provide 
5 one or rnoic ^\ indows wherein xha user can insert repre.sentation{s) of bioiogical molecules. 
In addition, the encoding function also, optioaaliy, provides access to private and/or pubiic 
databases accessible through a local network a«d'br the intranet whereby one or more 
sequences contained in the databases can be input iuto the methods of this invention. Thus, 
for example, in oiie embodiment, where the end user inputs a nucleic acid sequenced into the 

10 eiKodtng flinction, the user can, optionally, have the ability to request a seai-ch of GenBank 
and input one or more of the sequences returned by such a search into the encoding and/or 
diversity generating function. 

Methods of implementing Intranet and'or Intranet embodiments of 
computational and-'or data access processes are well known to those of skiJi in the art and are 

15 documentede in great detail (see, e.g., Cluer ei qL { 1992) A General Framework for the 
Optimisation of Object-Orienled Queries, Proc SIGMOD Jntema/iom! Conference on 
Managemem of Data, San. Diego, California, Jun, 2-5, 1991. SIGMOD Record, vol 21, 
Issue 2, Jun.. 1992; Stonebraker, M., Editor; ACM Press, pp. 383-392; ISO-ANSI, Working 
Drafl, "Information Technology-Database Language SQL", Jim Melton, Editor, International 

20 Organization for Standardization and Americaii National Standards Institute, Jul, 1992; 

Microsoft Corporation, "ODBC 2.0 Programmer's Reference and SDK Guide. The Microsoft 
Open Database Standard for Microsoft Windows.TM. and Windows NT.TM,, Microsoft 
Open Database Connectivity.TM, Software Development Kit", 1992, 1993, 1994 Microsoft 
Press, pp. 3-30 and 41-56; ISO Working Draft, "Database Language SQL-Part 2:Foundation 

25 (SQL/Foundation)", CD90"5-2 199 thi SQl Sep I j , 19Q" and tne hkc) 

Thosc skilled m the art wxW ^cc^'g^'/^. man\ inodsfK itioii^ nus be nude to 
this configuiation wnhout dtpirtuig tion"" the ^cupt of the present Jinentiot) I <!> e^aniplc, ni 
a t\o-ticr configuKUton the ser\cr >.vsrefu ^wa.m.: ih. tunction^ ot the W \\U pdlusj 
nw ai->o Csccuiv' the tunctjons ot the \\ eb sentx For example air' one oi tht .ibo\e 

30 descnbod cmbodmicnts could Dc moduicd to accept rcaucst^^ fiom usets usct lerminaK that 
are in a format other than a URL, Yet another modification would involve the adaptation to 
a multi-manager environment. 
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X> , tncvtrporadita a pin steal i\atuarioit and iVedback loop. 

Aj. indaatcd ab >^ . , vv \r" ptc;onca Jinbodinienis, ihc .seloction cntcna 
can require thai the moteciiloi'^l re.is j-c-^jteu ^ ire ooi cucnatcd sttmgs meci eeitani 
empirical ph>dicalK as'ia^^ed propeuk>->. To assay these properties it ts ncecssacy to obtain 
5 the encoded moiccuic^. i o accomplish this, the molcrtde(s) rcprf rented b> the 
concatenated string(s) are physicaUy synthesized (&g. chemically or by recombmanl 
methods) or isolated, 

Physicai synthesis of genes, prateins, polysaccharides encoded by the 
coUcction(s) of character strings produced according to the present invention is the primary 

10 means to create a physical representation of matter thai is amenable to a physical aSvSay for 
one or more desired properties. 

in a preferred embodiment, gene SNiithesis technology is used, typically, to 
construct libraries in a consistent manner and in close adherence to the sequence 
representations provided m tlie collection of concatenated strings produced by the methods 

15 of this invention. 

Pj eferred gene synthesis methods allow fast constntction of libraries of 10''- 
1 0'^ "gene/protein" variation. This is typically adequate for screening/seiection protocols as 
larger libraries are more difficult to maice and maintain and sometimes cannot he as 
completely sampled by a physical assay or selection methods. For example, existing 

20 physical assay methods in the art (including, e.g. . "Ufe-and-death" selection methods) 

generally allow sampling of about 10^ variations or less by a particular screen of a particular 
library, and many assay are limited to sampling about 10'*- 10^ members. Thus, building 
several smaller libraries is a preferred method as large libraries cannot easily be completely 
sampled. Larger libiuries, however, can also be made and sampled, e.g.. using high- 

25 throughput roethod.s. 

There arc many methods which can be used to synthesize genes, 
polysaccharides, proteins, etc. with well-defined sequences and the area is quickly 
developing. Solely, for the purpose of clarity of illustration, this discussion will focus on one 
of the many possible and available types of known methods for the production of biological 

30 molecules. 

Current art in the polj'nncleotide synthesis is best represented by well-loiown 
and mature phosphoramidite Ghemistry which allows one of sMH to eftectivcly prepare 
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oligonucleotides. It is possible, but somewhat impractical fo use this cheniihitry for routine 
synthesis of oligonucleotides significantly longer than 100 bp and the synfhcuc yiekl 
decreases and the degree or piiriiication required increases. Oligonucleotides of a "typical" 
40-SO bp si^e can be obtained routinely and directly with \'ery high purily. 
5 h is noted that oligonucleotides and eveti complete syiuhetie (double slianded 

or single stratided) genes cati be ordered from any of a number of commercial sources such 
as The Midland Certified Reagent Company (incrc@oiigos.com). The Great American Qcnc 
Company (http://w%'w.genco.coni), ExpressGen, Inc. (www.expressgenxom) Operon 
Technologies Inc. (alameda, CA), and niany others. Similarly, peptides can be custom 

10 ordered from any of a variety of sources such as PeptidoGenic {pkiin@ccnet,coni), HTI Bio- 
pro=dticts, Inc. (http:// wwvv.htibio.com), BMA Biomedicals, Ltd. (U.K., Bio-Synthesis, 
Inc., and raany others. 

A relevant demonstration of toiai gene, isynthesis from small fi-agments which 
is readily amendable to optimization, parallelism, and high throughpnt iss set forth by Dillon 

15 and Rosen (1990) Biotechniques, 9(3): 298-300, A simple and rapid PCR-baaed assembly 
process of a gene from a sei, of partially overlapping single-strand oligonucleotides without 
the use of Ugase is described. Several groups have also described successful application of 
variations of the same PCR-based gene assembly approach to the synthesis of various genes 
of increasing size, thus demonstrating the methods general applicability and combinatorial 

20 nature for synthesis of libraries of mutated genes (for useful references see also, Sandhu ef 
at. (1992) Biotechniqites, 12(1); 15-16, Prodomou and Pearl (1992) Proiein Engin., 5(8): 
827-829, Chea et al. (1 994) JACS, 1 1 94(1 1): 8799-8S00, Hayashi et al (1 994) 
Biotechniques^ 17: 310-314, and others). 

Mare recentiy Stenimer ef al (1995) Gme 1645; 49-53, pnavided evidence 

25 that PCR-based assembly methods are useiitl to build larger genes of up to at least 2.7 kb 
from dozens or even hundreds of synthetic 40 bp oligonucleotides. These authors also 
demonstrated that, from the four steps comprising the known PCR-ba.scd gene synthesis 
protocol (oligonucleotide syntljesis, gene assembly, gene arapiificationj and typically, 
cloning) the gene amplitication step can be omitted if a "circular "assembly PCR is used. 

30 Once prepared, the gene(s) can be inserted into vectors and the vectoi^ used to 

transfect host ceils and express the encoded pmtein(s) according to routine methods well 

known to tliose of skill in the art. Cloning methodologies to accomplish these ends, and 

sequencing methods to verify the sequence of nucleic acids are well known in the art. 
-38- 
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Examples of appropriate cionrag and sequencing techniqises, and instructions sufficient to 
dnoct pcisotii, of ski?! through many cloning cxckj-,.^ . ro *ouid jn Bcrgci ajid ivHTimcl, 
GuiJ< to MvUi^dlai i. iohiv^ Tecrt „•,< \ ,n' i / 'to'i'--, \ vil \'^2 Au-Jiicnuc Prt.ss 
it ic , S.H Diego C \ (Bcigci). Sairbfo.JK t> Mo'i.'i.iitj> Ctonnig _A Lahinaton 

5 Mamiol diJ et/ ) \'ol iO. Cold iJprtng Harbor I iboratorv Cold bprmg Harboi Piesb, \Y, 
and tun ent Protocols /« Molecular Biolog\\ F.M, Ausubel et al , eds., Current Protocols, a 
joint venture berwcea Greene Publishing Associates, Inc. and John Wiiey & Sons, Inc., 
(1994 Suppiement). Product infoniiation from manufacturers of biological reagents and 
expenniental equipment also provide information useful in known biological methods. Such 

1 0 manufacturers include the SIGMA chemical company (Saint Louis, MO), R&D systems 
(Minneapolis, MN), Pharmacia LKB Biotechnology (Piscataway, NJ), CLONTECH 
Laboratories, Inc. (Palo AUo, CA), Chein Genes Corp., Aldrich Chemical Company 
(Milwaukee, WI), Glen Research, Inc., GIBCO BRL Life Technologies, Inc. (Gaithersberg, 
MD), Fkika Chemica Biochemika Analytika (Fluka Cheniie AG, Buchs, Switeeriand), 

15 Invitrogen, San Diego, CA, and Applied Biosysiems (Foster City, CA), as well as many 
other commercial sotirce.s kt^owti to one of skiil. 

The physical molecules, once expressed can be screened for one or more 
properties and it can be determined whether or not they meet the selection criteria. The 
character strings encoding molecules meeting the physical selection criteria are then selected 

20 as described above. Numerous assays for physical properties (e.g. binding specificity and/or 
avidity, enzymatic activity, molecular weight, charge, tliermal stability, tempemture optima, 
pH optiiTia, etc.) are well known to those of skill in the art. 

In certain embodiments, the physical molecules can be subject to one or more 
"shuffling" procedures and optionally screened for particular physical properties, to generate 

25 new molecules which can then be encoded and processed according to tlte methods described 
above, 

A variety of "shuffling methods'' are known, including those taught by the 
inventors and their coworkers, e.g. Stemmer, el al. (1994) Nature 370: 389-391, Stemmer 
(1994) Proc. Natl Acad. ScL, USA. 91; 10747-10751, Stemmer, U.S. Patent Ho; 5,603,793, 
30 Stemmer et al. U.S. Patent No; 5,830,72 1 , Stemmer et aLU.S Patent No: 5,8 1 1,238. 

Minshuil el al U.S. Patent No; 5,837,458, Crameri et al. (1996) Nature Med., 2(1): 100-103, 
PCT Publications WO 95/22625, WO 97,^007S, WO 96/33207, WO 97/33957, WO 
98/27230, WO 97/35966, WO 98/3 1837, WO 98/13487, WO 98/13485 and WO 98/42832. 
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in aauuion. sc\ il '^oprnd ng <»pphciUou!> dcbiriibc miportaju \ "-HufPmjj; .ncUiodologies 
' hcc c i; ^opendn<4 I SS\ 1 jied Tulv 15, 1998, USSN 60/102 and 

Sehlonos and Stem ^ V ^ , < i sii ih^\ o f ui leoiKn \ A. 

pr>' n< pt' L^. f(n>y><^J^^'. >. > , . , 05 I'-W USSN f^O 118^^4) 

5 In .iddttion, the tlf^ethod^ de^cnb.d abo\e can also be piact{i,td m a p jralkl 

mode svhere each of the individual library members, including a pluraiity of the genes, 
proteins, polysaccharides, etc. for subsequent physical screening are synthesized in spatially 
segregated vessels or arrays of vessels, or in a pooiwise manner where all, or part, of the 
desired pjarality of moleputes; are synU;iesi;?ed in a single vessel. Many other synthetic 

1 0 approaches are known and specific advantages of one versus another may readily be 
determined by one skilled in the art. 

The processes discussed herein are amenable to production using high- 
ttuwighpiit systems. High throughput {e,g. robotic) systems are commercially available (see, 
e.g., ZymarkCon^., Hopkinton, MA; Air Technical Industries, Meotor, OH; Beckmao 

15 Instruments, Inc. Fullerton, CA; Precision Systems, Inc., Nalick, MA, etc.). These systems 
typically automate entire procedures including all .sample and reagent pipetting, liquid 
dispensing, tinted incubations, and ilnal readings of the microplate in detector(s) appropriate 
for the assay. These configuarable systems provide high throughput and rapid start up as 
well as a high degree of flexibility and customization. The manufacturers of such .systems 

20 provide detailed protocols the various high throughput. Thus, for example, Zymark Corp, 
provides technical bulletins describing the use of high throughput systems for cloning 
expression aitd screening of chemically or recombinant] y produced products. 

Mi Uses of the generated string poputationfs). 

25 In one embodiment the methods of this invention provide a poptilation of 

character strings. Particularly preferred character strings represent encoded biological 
molecules and typically tlie encoded molecules bear some relationship to each other 
reflecting a level of biological organisation. Consequently, the character strings produced by 
die methods of this invention do not reflect a raiidom or haphazard selection fmm a imifonn 

30 sequence space, but rather captui-e degrees of relatednsss (or variation) reflective of that 

partieuiar level of organization (e.g. gene, gene family, individual stibpopulation, etc.) found 
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m tiie natural world. The coJtections of cliaracter strings (e.g, populated data structure) 
produced by the methods of this invention thus provide a useM starting point for various 
tn ohiiionar}' iiKHiels and are convenient for use in evolutionary algorithms (evolutionar)' 

cotiipiiting). 

Whcti ust'J it^ Mtci) nuxlcls, the populations (collections of character strings) 
produced by tiie nieihods of this iirveiition provide far more infomiation than evolutionary 
algorithms run oa arbitrary popuialions. 

For exarapie, where an evoiutionar>' algorithm utilizes as a starting point, a 
population comprising a set of tandoirn or arbii^aiy meittbefs, the dytiamks of the simulation 
reflect progression from the arbitrary starting point to a particular solution {e.g. distribution 
of properties in the resulting popalation(s)). Since the starting point is arbitrary and 
essentially unrelated to a population produced by a natural process, these dynainics afford no 
information regarding the dynamics of natui-al processes/populations. 

In conttast, the collections of character strings produces by the methods of 
this invention contain far more information tiian the randomly produced starting points used 
in conventional evolutionary algorithms. First, each member of population coinains 
considerable jnibnnation regarding molecular structure. Thus, one member is distinguished 
from anodier aiember not simply as "self7not-self ' (ie. an allelic representation), but rather 
members are distinguished by degrees of relatedness/simiiaritj'. Members of the populations 
produced by the methods of this invention will reflect varying degrees of covariation. 

In addition, because tlie populations produced by the methods of this 
invention i-eflect a fine structure characteristic of the level of biological organization encoded 
into the initial strings, the initial dynamics of a simulation run using these starting sets 
reflects the dynamics of "real world" populations and aSbrds considerable insight into 
evolutionary processes. 

In addition, because specific molecules are represented by members generated 
using the methods of this invention, evoluiionasy algoritiuns run using these data stmctures 
provides real inlbrmation regarding molecular evolution andy'or the design of new and useful 
molecular entities, 

B) Use m index genera tton. 

In another embodiment, the data structures generated by the methods of this 
invention can be used as tags (indice.s) for indexing essentially any kind of infomiation, IN 
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this approach, inionnation ot f-ir.ucr sirrl?rity is tagged using members of the JtUu simctuic 
(charactet sti ings) h iru guraic vi'mJ:: u ^ i}e mformalion of lower ^-tmilii u\ tf ta(:i:ed 
With membeis ot the u M jctM j /i^ a .\ - -v^.-^r In pjctcned 5.'mhodunt;nb, the 
simiianty ot the cluua^tof >t./ „ - 1 -<\ v t.^j o -c oiiv picoc^ oi aat.^ rt,t1ctt^ {is 
5 proportional to) the t.)nij'aiit> t .J'o:^^^ infoTr-tion 

When a ^cMch is performed. <ui jniudj lui is jucnuiiaj umui: tratiuiona! seatch 
tcchiiKjue^ Then, when cio^eiv related infotmation is desired, ihe datu :>tnicture cati 
seai'ched for similar members usiug aB> of the well known siimlarUy algorUhrns described 
above. These similarity aigorithms are designed to provide a thorough, rapid, and efficient 
1 0 search of an enormous data space. Whm meiiibers (indices) of desiml similarity are 
identified, they will point to the tagged data thereby providing the end user with related 
information. 

C> Use as reference objects in data base seare^^^ 

In a related application, the data structures produced by the methods of this 

1 5 invention, or the memhers of s«ch data striiclures (i^e„ the character stjf ings) ean be used as 
reference objects in database searches. For example, initial known informatioi^ (e.g. 
molecular structiire, or index strings from a knowledge database as described above) is 
encoded and modified according to the methods described herein. This produces a «ew data 
stmcture that captures related, l)ut non-obvious variants of the initial encoded information. 

20 The resulting information (e.g., members of ti)e data stiucture) can be 

deconvolved to identify actual or theoretical niolecu!e(s) and this can be used lo searcii 
typical databases for the same or reiated molecule-s. Where the encoded infomiation is from 
a database index, tlie member of the data structure can be used to probe the original or new 
databa.se to identify relevant/related information, 

25 i» Identification, of strHctoral,jtno^^^^ 

projperiies. 

It is ofien of interest to identify regions of a niolecuie (eg, a protein) that may 
be responsible for specific properties, e,g. to facilitate fanctiona! manipulation. This is 
traditionally done using stntcturai information, usually obtained by x-ray crystallography. 
30 The sequences of naturally occurring enzymes that catalyze similar or even 

iderrtical reactions can var>' widely; sequences may be only 50% identical or less, while a 
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fijini!)- oi'^iicb viwine.^ rnvty catalj-ze oiie iderrtical reaction, othcj properties of these 
eii/viues may differ significantly. These mcKsv'e pi \?icai propcaics siicfi as ^tabiiuy to 
{emperaturc and otganic soivcnts, pi! ept-r-a, - iubj./'v, j-ibn *c ic^am j^ctivity when 
inu)tr>bili7od, case of expression in uinv^-cni hosS siv^tcm-^ "ihc^ also inciudo cyiaijtic 
5 pn.prrties mciuding actJMiy (fC,t and K.,„), the range ofsubstraies accepted, and even of 
(.lieini uio pftfoiJHfd, THF. rneihods described here can also be applied to non-catalvtic 
prtitciiis (eg liuands such as cx'tokioes) and even nucleic acid sequences (such as promoters 
that may be mdueible by a number of diffcixnt iigands), wherever multiple fimctionai 
diitsensions are encoded by a famiiy of "homoiogous "sequences. 

10 Because of the divergence between enzymes with similar catalytic fimctions, 

it is not usually possible to correlate specific properties with individual amino acids at certain 
positions. There are just too many amino acid differences. However, libraries of variants 
can be prepared from family of homologous natural sequences by encoding members of the 
family irito initial strings according to the methods of this invention, then selecting and 

15 concatenating substrings to populate a data stmcture with encoded variants. 

The encoded or deconvolved variants can be tested in siiico for desired 
propeilies and/or the encoded variants can be deconvolved, and tlie conesponding molecule 
piiysicaily synthesized as described above. The synthesized molecule can then be screened 
for one or more desired properties. 

20 If members of fiie data structure are tested under a specific set of conditions 

for a particular property, the optimal combinations of sequences from the data structure (or 
the initial string coHection) for those conditions can be determined. If tlie assay conditions 
are altered in only one parameter, different individuals firom the library (data structure) will 
be identified as the best perfomicrs. Because the scrcersing conditions are very similar , most 

25 amino acids will probably be conserv^ed between the two sets of best perfonners (the best 
performers in (he initial string collection (set 1) and the best performers in the populated data 
structure (set 2)} Comparisons ol the ^equot Cv,-- ol t best cn/vmcs under the t\\ o dittccnt 
tondit'.ons iherehitt. tdontt v' the se(jucn(.t djtrejcnces lesponsipio uf t!i, (hOc.f'iKf^i ni 
periomiance, 

30 Prmcipie component analysis (e.g. ussng Partek type software ; is one ot many 

muUi-variate tools useful for such m analysis. 
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E) Use in Eenerating music, 

ill still anotlier embodiment, the methods of this mvention can be used to 
generate music . Using my of a number of weil known programs, bioiogical molecules (e.g. 
DNA, pR>teins, efa ) can be encoded into romical notes. This cm involve mapping a 
5 particular subunit onto a particoiar note. The timing and/or timbre of the notes is determined 
by the motif and/or secondary structure in which the subunit occtirs, 

Tlius for example, the progi-ani SS-midi has been used to encode various 
nucleic acid and amino acid sequences into music. In one approach (DNA calypso, purines 
were played 3/2 the speed of pyrimidines, the bases C,T,G>A were mapped to the notes 
10 C,F,G A and the first strand was piaycd with j&2z organ, while the complementary strand 
with bass. In other approaches note duration can be longer when the note/si5bunit is found in 
a helix then when it is found in a p~sheet. Other vaiiarjts are, of course, possible. 

In the methods of this invention, the bioiogical molecules are encoded iiito 
strings, the substrings selected and concatenated and the data staicttjre populated as 
i.5 described above. The populated data structure is then used as input to a program (e.g., SS- 
midi) tliat maps the new sequences encoded in the data stnicturc into niussic. The data 
structure can be iteratively repopuiated as described above tltereby generating variants of tlie 
musical phrases ttius produced. 

y> V&e ia driving synthetic machinerv> 

20 As indicated above, the data .structures prtxiuced by the methods of this 

invention can be used to di ive devices for the chemical synthesis of the encoded molecules 
(e.g. pol>'peptides, nucleic acids, polysaccharides, etc.). Using only a few initial sequences 
("seed members") tlie mefliods of this thyention provide iiteraUy teas, hundreds, Ihousarids, 
tens of tlioiisancis, hundreds of thousands, or even mltlions of different encoded molecules. 

2,5 W\im the resulting data structure, or members hereof, is used to drive a chemical {or 

recombinant) synthesis a "combinatorial" library of the desired molecules of virtually any 
size can be prepared. Such "combinatorial" iibranes are widely desired to provide ,^ystems 
for screening for therapeutics, industrial process molecules, particular enzymes, etc. 

30 EX.4MPLES 
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Tlie foHowing examples ss-e offered to tllusti^te, btit not to limit the claimed 

invention. 

Examok I: Subtllisin family modet. 

Amino acid sequences were aligned. (Codon usage can be optimized on 
5 jcfrotrau'^latjon lor a pjcicnod o\prc«.Mon sv stem, and immber of ohgonucleotidcs for 

byuthcMs j Ik rautuut/cui \ IX>' p ot p. rv\ >,o aujn i\^u{ ol dl po-isibk pdii-i of "7 p^Jiciiti 
was made (1 ig 5, 7) Pan (i and 7 t,hov>ed 95 "o peiceni tdcnuiv pe' each \%stidow of '7<rra, 
whiitf aU other pans vsbv,nNei.iSO% percent identity per oach\vnido\N \ . \o!e dwi 
stringency oi dijgnmcnt {and >>ubs»equent representation ofcrosso%ct h(.u\ccn paionts; can be 
iO mjmipulated individually foi each pair, so that low homology crossover can be repjes>eirted at 
the expense of highly homologous parents. No structural biases or active site biases were 
incorporated in this mode!, 

■ExajnH„j[iM;2; A„,|)r6cm;tQ 

ch mierkal , i^toIynucIcoUde^^ : 

15 Firsit, substrings are identified and selected in parental (initial) strings for 

applying a crossover operator to from chimeric junctions. This is performed by: a) 
identifying all or part of the pairwise homology regions between all parental character 
springs, b) selecting all or part of the identified pairwise homology regions for indexing at 
least one crossover point within each of the selected pairwise homology regions, c) selecting 

20 one or more of the pairwise non-homology regions for indexing at least one crossover point 
within each of the selected pairwise nonhomology regions ("c" is an optional step which can 
be omitted, and is also a step where structure-activity based eiitism can be applied), thereby 
providing a description of a set of positionaily and parent-indexed regions/areas (substrings) 
of parental character strings suitable for further selection of crossover points. 

25 Secondly, further selection of crossover points within each of the jsufastrings 

of the set of the substrings selected in Part 1 is performed. The steps include: a) randomly 
selectiiig at least one of tlie crossover points in each of the selected subsmngs, and-'or b) 
selecting at least one of the crossover points in each of the selected substrings, using one or 
more of aimealing sitnulation-based models for determining probability of the cros.50ver 

30 point selection within each of the selected substrings and/or c) selecting one crossover point 
approximately in the middle of each of the selected substrings, tliereby creating a set of 
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pa)rw3se ciossovet points, wheic each point is indexed lo coirespondmg character posjtio«N 
IS each of the parenial stniiKs desired to from a chimeric junction at that pomi. 

ihirJj op '0 UL- u L i| iienls aic pcrlomiet^ Depending on 
meinot-i^ use 1 1 1 det(.im!no nunc^ im'- ^-vraJn^DNXo A \) th.. pn^i. t.->s wan b^. 
varied f ex-ifnp'c if i ONA ; ' si i u i-U it u>+ toJon^ ior ii e seit<. tn 

expi cswon s^sti.m 1!, peitoifmd iui cvcis pasv^ 1^.1 ^ j.Abf Kiju'-tiheiit oj c )d(- mumg 
parents can be peifnnneJ to Mctndaidi/e cuuot usd^ic for e\ors gnen ammo acid at e\t.iv 
corresponding position J hit. procc^b can significantly dcci case total number ot distinct 
oligonucleotides for gene bbraiy synthesis, and may be particularly beneficial for eases 

i 0 Where AA homology is higher tiiaii DNA homology, or with fatniUes of highly homologous 
genes (e.g. 80%+ identical). 

This option has to be exercised with caution, as it is in essence an expression 
of an elitism mutation operator. Thus, one considers the benefits of cutting the costs of 
oligpimcJeotides versus introdactiofl of this bias, which can have undesirable consequences, 

15 Most typically, one uses codoiis which encode AA at a given position 'in a majority of 
parents. 

If AA sequences are ussed: a) retrotranshue sequence to degenerate DN A; b) 
define degenerate nucieotides using pos!tjo!).-by-posti.ion refeiencing lo codon usage in 
original DMA (of majority of parents or of corresponding parent), and/or - exercise codon 
20 adjustrhenls suitable for the selected expression system where a physical assay will be 
performed. 

This step can also be used to introduce any restriction sites within coding 
parts of the genes, if any, for subsequent identification/QA/deconvolution/manipulations of 
libra;r>f entries, AH crosspv# points identified in Part 2 (indexed: to pairs of parents) arc 
25 correspondingly indexed to tlie adjusted 0NA sequences. 

Fourth, oligonucleotide airangements are selected for a gene assembly 
scheme. This step includes severa! decision steps; 

Uniform 40-60 mer oligonycieotjdes are typjcally used (using longer 
oligonucleotides will result in decrease of the luiruber of oligonucieotidcs to build parents, 
30 but uses additional dedicated otigonucleotides for providing representaiion of closely 
positioned crossovers/mutations. 
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Select vvhethcr shorter or langer oligonucleotides am allowed (/.t='. , a Yes/No? 
decision). A "Yes" decision cuts the total number of oHgonucleotides for high iiomology 
genes of different lengths with gaps (deletion/insertion), especially for !-2aa). 

Select the overlap length (typically 1 5-20 baKcs, which can be symroetricai or 

5 asymmetricaL) 

Select whether degenerate oligonucleotides are allowed (Yes.'^'o?). Another 
potent cost cutting feature and also a powerful mtms to obtain additional sequence diversity. 
Partial degeneracy schemes and rnimmized degeneracy schemes are especially beneficial in 
bdlding, matagenic Hbmries, 

H) If software tools are used for tiiese operations, several Variatioiis of the 

parameter are run, to select maximum Hbmry complexity and minimal cost. Exercising 
complex assembly schemes using oligonucleotides of various length significantly 
complicates indexing processes and, subsequently, assembly of the library in positionally 
encoded paraflel or jsartial pooling formats. If this is done without sophisticated software, a 

15 simple and tmifotnr scheme (e.g. all oligohucleotides 40 bases long with 20 bases overlap) 
can be used. 

Fifth, ^'convenience sequences" are designed in front and in the back of the 
parent strings. Ideally, it is the same set which will be built in every library eniry at the end. 
These inchide any restriction sites, primer sequences for assembled product identiftcatious, 

20 BBS, leader peptides and other special or desirable features. In principle, the convenience 
sequerxes can be defined at a later stage, and at this stage, a "djmtmy" set of appropriate 
length can foe used, e.g. a substring from an easily recognizable forbidden letters. 

in Part 6 an indexed matrix of oligonucleotide strings for building every 
parentis created, according to the selected scheme. An index of every oligonucleotide 

25 includes: a parent identifier (parentID), indication of coding or complcmcinary chain, and 
position numbers. Crossover points are determined for indexed codinu striTig oi'swry paretu 
with head and tail convenience substrings, .\ conipiemeniary chain of every siring i^ 
generated. Evcrs coding string is selected accordmg to the selected assembly VCR scheme 
50 part 4 (e.g. in increments of 40 bp) Every complement string is split according to the 

30 same scheme (e.g. 40 bp with 20 bp shift). 

In part an indexed matrix of ohgonucleotided is created for every pairvvise 
crossover operation. First, all oligonucleotides which h.ave pairwise crossover markers arc 
determined. Secoiid, all sets of all oligonucleotides which have the same position atid same 
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pair of parents crossover markers (4 per crossover pomt) arc determined. Third, every set of 
4 oligonucleotide strings are taken whjch have been labeled with the sanic crossover inarkcf, 
and another derivative set of 4 chimeric ohgcrLr'eo'KL- -tnngs vMnipnsmi.' of chdiaLlers 
encoding 2 codi ig and 2 complement chains < c „ v .tl- J bp -h^l \s\ 40 2ft Co schenic;- 'iv^ 
5 niaue 2 ( odm^' strings .ire po<!smic. navit i. j t.M \ .r^. d j,u,.;tnLe e'ibstnnjj ot one p<irem 
Ini lowed b\ tiK bacKward en<i oi the sectmd p.uent aftei oiossovei point Comp.ftnetit 
Strings 0 also dcMgned ui the i.dma fd;,hion, dica*bv obtaining an mdvxed complete 
inventory of binngs, encodmg oligonucleotides suitable for gene library assembly b> VCR. 

This inventory ean furtlier be optionaiiy refined by detectmg all redundant 
10 oligonucleotides, counting them and deleting from inventory, accompanied by the 
introduction of the count value to the "abundance=«amooat" field in the index of each 
oligonucleotide siring. This may be a veiy beneficial step for reducing total number of 
oligonucleotides for library synthesis, particuiariy in the cases if parental sequences are 
Wghly hdmblogbus. 

15 Modificatidm can be made to the methods iuid msteriais as hereinbefore 

described without departing from the Spirit or scope of the invention as claimed, and the 
invention can be put to a number of differem uses, includmg: 

The use of an integrated system to generate shuffled nucleic, acids and/or to 
test shuffled Bueieic acids, inchsded in an iteralive process, 

20 Ail assay, kit or system utilizing a use of any one of the selection strategies, 

materials, components, methods or substrates hereinbefore described. Kits will optionally 
additionally comprise instructions for performing methods or assays, packaging materials, 
one or more containers which contain assay, device or system components, or the like. 

hi an additional aspect, the pa^sent invention provides kits embodying the 

25 methods and apparatus herein. Kits of the invention optionally comprise one or more of the 
following; (1) a shuffled component as described herein; (?.) instructiotr fur practicing the 
methods ticsciibed herein, and'or for operating the sciccuau procedure herein; (3) one or 
niorc <issa> component, (4) a c-intciinei- for bokimi? -iuck'-e ids or enzymes, othci nucleic 
acids, transgenic plants, an;m lis, O.K.- ; - : i s-.e, i > - ,i. i r u\!tcn,ils, and (6) .software 

30 for performing any of the piot. ci,-, and'oi decision sicrps lu'^ ee iserein 

In A further aspect, the present invention provide*: for ihe use of an> 

component or kit herein, for the practice of any method or assay herein, and/or for the use of 

any apparatus or kit to practice any assay or method herein. 
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U is understood ilmi the examples and embodimeniri described herein are for 
illustrative purposes only and that various modifications or changes in iight thereof will be 
suggested to persons skilicd m (r-c a>i arid :irc lo bo meiiided nhin liie spirit and ijurview of 
this apiiiiofitton and scope of the. appended claims. Ail piibiicatioas, patents, and patent 
5 appheationy cited herein are hereby incorporated by refereace in their entirety for ali 
purposes. 
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CLAIMS 

What is claimed is: 

1 . A method of pppulating a data structure with a pluFdEty of character 
strings, said method comprising: 

5 i) encoding two or more a biologicai molecules into character strings 

to provide a collection of two or more differejit initial charactef strings wherein each of said 
biological moleetiles coiUprises at least about 10 subiimts; 

ii) selecting at least two substrings fR>m said character strings; 

iii) concatenating said substrings to fomi one or more product strings 
10 about the same length as one or more of the initial character strings; 

iv) adding the prodact .strings to a collection of sliiiigs; and 

v) optionally repeating steps (i) or (ii) through (iv) using one or more 
of said product strings as an initial string in the collection of initial character strings. 

2. The method of claim 1, wherein said encoding compriises encGditig 
1 5 one or more nucleic acid sequences into said character strings. 

3. The method of claim 2, wherein said one or more nucleic acid 
sequences comprise a nucleic acid sequence encoding a known protein. 

4. The method of claim 1 , wherein said encodiiig comprises encoding 
one or more amino acid sequences into said character strings. 

20 5, The metiiod of claim 4, wherein said one or more amino acid 

sequences comprise a nucleic acid sequence encoding a known protein. 

6, The method of claim I, wherein said biologjcal molecules iiave at 
least 30% sequence identity. 

7. The method of claim 1 , wherein said selecting comprises selecting 

25 substrings .such thai the ends of said substring<^ occur in string region*; of about 3 to ah<«it 20 
characters that have higher scqut-uce uicntny wnih int.- cotrespoiKlmi; reuion oftiinMhei of 
said initiai character strings than the overall sequence identity between the same two strmgs. 



-50. 



wo 06^0359 



PC17US0ft^ll38 



8. The method, of claim i , wherein said seleciing comprises selecting 
sitbstrings such that the ends of said substrings occur in predefined motifs of about 4 to about 
8 characters. 

9. The method of claim 1, wherein said selecting and concatenating 

comprises concatenating substrings from two different initiat strings such that the 
concatenation occurs in a region of about tliree to aboiit twenty characters having higher 
sequence identity between said two different imtial strings than tiie overall sequence identity 
between said two different initial strings. 

10. The method of claim 1 , wherein said selecting comprises aligning two 
or more of said initial character strings to maximixc pairwise identity between two or more 
substrings of the character stiings, and selecting a cliaracter that Is a ftieinbei- of an aligned 
pair for the end of one substring. 

1 1 . The method of claim 1 , wherein said product strings are added to the 
collection only if they have greater than 30% sequence identity with the initial strings. 

12. The method of claim 1 > wherein said method ftifther coniprises 
randomly altering one or more characters of said character strinp, 

1 3. The method of claim 1 2, wherein said method further comprises 
randomly selecting and altering one or tnore occun-ences of a particular preselected character 
in said character strings. 

14. A computer program product comprising computer code that 

i) encodes two or more a biological molecules into character strings to 
provide a collection of two or raoi^ different imtial character strings wherein each of said 
biological moiecuJes comprises at least about ten subunits; 

ii) selects at least two substrings from said character string; 

iii) concatenates said substrings to form one or more product strings 
about tlie same lengtli as one or more of the initial character strings; 

iv) adds the product strings to a collection of strings; v^td 

v) optionally repeats steps (i) or <ii) tlwugh (iv) using one or more of 
said product strings as an initial string in the collection of initial character strings. 
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1 5. The progrmn of ciami 1 4, wheroin said two or more biologicaJ 
molecuies are iiucieic acid sequences. 

1 6. The program of claim 14, wherein said two or more btologicai 
molecules are nucleic acid sequences of known proteins. 

5 i 7. The prograin of claim 1 4, wherein said two bv more biotogicaJ 

molecules are amino acid sequences 

1 8. The program of claim 14, wherein said biological molecules have at 

least 30% sequence identity. 

19. Tlic program of chum 14, wherein said code selects siibsirings such 
10 tliat the etids of said siibstriijgs occur in string regions of about three to about twenty 

characters that have higher sequence identity with the corresponding region of another of 
said initial character strings than the overall sequence identity between the same two strings, 

20. The program of claim 14, wherein said code selects substrings sucb 
tliat tlie ends of said substrings occur in predefmed motifs of about 4 to about 8 characters. 

15 2 1 . The program of claim 14, wherein said code selects and concatenates 

substrings from two different initial strings such that the concatenation occurs in a region of 
about three to about twenty c}mracter.s having higlrer sequence identity between said two 
ditlerent initial strings than the overail sequence identity between .said two different initial 
strings. 

20 22. The program of claim 14, whemin code selects substrings by aligning 

two or more of said initial character strings to maximize paitwise identity between two or 
more substrings of the character strings, and .selecting a character tiiat is a member of an 
aligned pair for the end of ome substring. 

23. The program of claim 14, wherein said product strings are added to 
25 the coUection only if they have greater than 30% identity with the initial .strings. 

24. The program of claim 14, wherein said method further comprises 
randomly altering one or more characters of stiid character strings, 
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25. The program of claim 24, wherein said nietiiod further comprises 
randomiy selecting and altering one or more occurrences of a particalar preselected character 
in said character strings. 

26. The prograra claim 14, wherein said code is stored on media selected 
5 ftom the group consisting of magnetic media, optica] media, optomagnetie tnedia. 

27. The program claim 14, wherein said code is ih dynamic or static 
memory of a computer. 

28. A label generating system for creating a plural ity of related labels, said 
iabeiing system comprising: 

1 0 an encoder for encoding two or more initial strings from biological 

molecules; 

an isolator for identifying and selecting stibstrings from said two or 

more strings; 

a concatcnator for concatenating said snfastr ings; 
15 a data structure for storing the concatenated substrings as a collection 

of strings; 

a comparator for measuring the number and variability of the 
collection of strings and determining that sufficient strings exist in the collection of strings; 
and 

20 a command writer for writing the collection of strings into a raw string 

file. 

29- The system of 28, wherein said isolator comprises a comparator for 
aligning and determining regions of identity between said two or more initial strings; 

30. The .system of 28, wherein said encoder comprises a means for 
25 encoding a nucleic acid sequence into a character string. 

3 i . The system of 28, whetein said encoder compmes a means for 
encoding an amino acid serijuence into a character string. 
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32, The system of claim 28, wScmn said eompacator cornprises a meaiVs 
for caicuktiog sequence identity, 

33, The system of claim 28, wherein said isolator selects substrings such 
that the eads of said substrings occur in string regions of about three to about 100 characters 

5 that have higher sequence ideal jt>' with the correspondiBg regioa of another of said initial 
character strings than the overall sequence identity between the same two strings. 

34, The system of claim 28, wherein said isolator selects substrings such 
that the ends of said substrings occur in predefined motifs of about 4 to about 8 characters. 

35 . The system of claim 28, wherein said isolator and concatenator 
i 0 indi vidualiy or in combination concatenate substrings from two different mitial strings such 
tliat the concatenation occurs in a region of about tliree to about 100 characters having higher 
sequeiice ideatity between said two different initisti strings than the overall sequence identity 
between said two different imtial strings. 

36. The system of claim 28, wherein said isolator aligns two or more of 
1 5 said initial character strings to maximize pairwise identity between two or more substrings of 
the character strings, and selecting a character that is a member of an aligned pair for the end 
of one substi'ing. 

37- The system of claim 28, "«\*efej» said comparator adds strings to said 
data structure only if they have greater than 30% identity with the initial strings. 

20 38. The system of claim 28, further comprising an operator to randomly 

altering one or more characters of the character strings, 

39 . The system of claim 38, wherein said operator i^ndomly selec ts and 
alters one or more occurrences of a particular preselected character in said charactc:f strings, 

40. The system of claim 28, wherein data structure is a data structure that 
25 stores encoded nucleic acid sequences. 

41. The system of claim 28, wherein data structure is a data structure that 
stores encoded amino acid sequences. 
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